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In This Issue 


This issue of Survey Methodology contains articles on a variety of topics. Kott, Amhrein and 
Hicks tackle the problem of multi-purpose surveys. For such surveys, it would be desirable to be able 
to stratify the target population in various ways in order to improve the precision of the estimates of 
interest. The authors present four sampling methods for the selection of samples through various 
stratifications while reducing the overall size of the sample. These strategies are then evaluated using 
data taken from an agriculture survey. The authors then show how a calibration estimator can 
improve the relative efficiency. 

Singh, Horn and Yu examine the problem of estimating the variance of the general linear regression 
estimator. They carry out calibration at two distinct levels. The higher-level calibration thus defined uses 
the known total and variance of the auxiliary variables. The authors show that this method covers a 
broader range of estimators than the lower-level calibration method, which uses only the known total of 
the auxiliary variables. An empirical study is presented to assess the efficiency of the proposed strategies. 

Hidiroglou and Sarndal concern themselves with the use of auxiliary data in two-phase sampling. 
They explain how these data are converted into calibration weight, in two phases, in order to create 
efficient estimators of a population total. The authors show that the calibration estimator, using the 
generalized least squares function, can be expressed as a perfectly equivalent two-phase regression 
estimator, that is, an estimator that is the product of two successive regression fits. They examine forms 
of the two-phase calibration estimator when the auxiliary data are for population subsets known as 
“calibration groups.” They also discuss the estimation of domains of interest and the estimation of 
variance. 

Byczkowski, Levy and Sweeney consider survey frames having a many-to-many structure, that is, any 
unit in the frame may be associated with multiple target population elements and any target population 
element may be associated with multiple frame units. This problem is motivated by a building 
characteristics survey in which the target population consists of commercial buildings, but the frame 
consists of a list of street addresses (which in turn correspond to either single buildings, multiple 
buildings or parts of buildings). Under this setting, estimators of totals and means and their variances 
using simple and stratified random sampling without replacement are developed. 

Yansaneh and Fuller present a recursive regression estimation procedure to reduce the computational 
complexity associated with best linear unbiased estimation in the context of a repeated survey with partial 
overlap. They use data from the U.S. Current Population Survey (CPS) to compare variances of their 
recursive regression estimator to some alternative estimators including the current CPS composite 
estimator. The proposed estimator seems to be very competitive for estimates of both level and change. 
They also estimate variances under various rotation patterns and find that the current 4-8-4 rotation 
pattern is superior to continuous rotation for current level and long-period averages, but inferior for short 
period changes. 

Lehtonen and Veijanen bring together two well-known ideas, generalized regression (GREG) and 
pseudo maximum likelihood estimation, to develop a new methodology for estimating the population total 
of a categorical survey variable, given a vector of known auxiliary variables. The values of the categorical 
variable are modeled as realizations from a multinomial logistic and the corresponding unknown 
parameters are estimated through pseudo maximum likelihood. Then, the population frequencies of 
interest are estimated via a modified GREG estimator which uses these estimated parameters. Variance 
estimates of the frequencies are given through Taylor linearization, and some empirical results based on 
Finnish Labour Force Survey data are provided. 

Casady, Dorfman and Wang consider the construction of confidence intervals for domain parameters 
in the case where the domain sample size is not fixed by the design. They condition on the observed 
domain sample size and show how, under certain assumptions about the population, conditional t-based 
confidence intervals can be obtained. In an empirical study using data from the U.S. Bureau of Labor 
Statistics Occupational Compensation survey, they demonstrate that the proposed conditional intervals 
have better coverage probabilities than standard marginal intervals. 


In This Issue 


Montanari compares two well-known estimators of a finite population mean: the GREG and the 
design-optimal regression estimator obtained from the difference estimator. While the former can be 
inefficient if the underlying model is misspecified, the latter, although model-free, is vulnerable to 
sampling fluctuations. An efficiency measure, which provides a criterion for choosing between the two 
estimators, is given. The results of an empirical study, which investigates the behaviour of both estimators 
under a variety of misspecified and correct models, are discussed. 

Haines and Pollock provide a fresh examination of estimating totals with multiple frames. Estimators 
are developed when information is only available from list frames and, in addition, when information is 
also provided from an area frame. A simulation shows that the best estimator depends on the known, or 
assumed, dependence of the frames. They also study the situation when observations are either available 
for all units or only available for a sub-sample from each frame. Again, the preferred estimator changes 
when the dependence between frames is considered. 

Bates and Gerber analyze the dynamics of a difficult problem: how temporary mobility of an 
individual contributes to within-household coverage error. They develop a two dimensional typology to 
characterize temporary mobility, then using data from the Living Situation Survey, conducted in the U.S. 
in 1993, they identify four temporary mobility patterns. Two of these traits are found to be useful 
predictors of persons missed from censuses or surveys. 


The Editor 


Survey Methodology, June 1998 
Vol. 24, No. 1, pp. 3-9 
Statistics Canada 


Sampling and Estimation From Multiple List Frames 


PHILLIP S. KOTT, JOHN F. AMRHEIN and SUSAN D. HICKS' 


ABSTRACT 


Many economic and agricultural surveys are multi-purpose. It would be convenient if one could stratify the target 
population of such a survey in a number of different ways to satisfy a number of different purposes and then combine the 
samples for enumeration. We explore four different sampling methods that select similar samples across all stratifications 
thereby reducing the overall sample size. Data from an agriculture survey is used to evaluate the effectiveness of these 
alternative sampling strategies. We then show how a calibration (i.e., reweighted) estimator can increase statistical efficiency 
by capturing what is known about the original stratum sizes in the estimation. Raking, which has been suggested in the 


literature for this purpose, is simply one method of calibration. 


KEY WORDS: Calibration; Collocated sampling; Permanent random numbers; Poisson sampling; Systematic probability 


proportional to size sampling. 


1. INTRODUCTION 


Many of the list frame surveys conducted by the National 
Agricultural Statistics Service (NASS) are integrated in the 
sense that data on a range of heterogenous items, such as 
planted crop acres and grain stock inventories, are collected 
in a single survey rather than through a number of indepen- 
dent surveys. Bankier (1986), Skinner (1991), and Skinner, 
Holmes and Holt (1994) have shown how an old method of 
combining independently drawn stratified simple random 
samples — where each sample comes from a (list) frame 
with a different stratification scheme — can be made more 
efficient; that is, the variances resulting from such a 
combined estimation strategy would not be as large as those 
from the independent surveys summarized by themselves. 

Even more appealing for many applications would be a 
sampling design that tends to select the same units from 
every frame, thereby reducing both the cost and respondent 
burden of an integrated survey. This paper explores several 
such designs. Three make use of permanent random 
numbers. The fourth uses a variation of systematic proba- 
bility proportional to size sampling. The goal for each is to 
meet or exceed — at least on average — a particular set of 
sample size targets. 

The paper shows how a calibration (i.e., reweighted) 
estimator can provide relative efficiency by capturing what 
we know about the original stratum sizes in the estimation. 
A final section points out that the use of a calibration tech- 
nique can do more than simply reflect original stratum sizes. 

An alternative strategy for burden reduction is to use 
separate instruments for different survey targets and to 
select distinct samples for each instrument. This increases 
the number of units selected over all, but reduces the burden 
per selected unit. NASS is using that approach in its 
Agricultural Resources Management Study (see Kott and 
Fetter 1997), but it is not the approach to be discussed here. 


2. INDEPENDENT SAMPLING AND UNBIASED 
ESTIMATION 


Suppose we have F independent frames; for example, a 
sorghum frame, an oats frame, and a general grain stocks 
frame. Each frame is stratified independently, and without 
replacement simple random samples are drawn from each 
stratum of every frame. Frame f (say, the oats frame) 
contains Hy strata; stratum h (large oats operations) in 
frame f has N,,, population units, out of which n,, units are 
selected. The union of the F frames must cover the entire 
(list) population, but no single frame need be complete. 
The frames may overlap. 

One unbiased estimator for a population total T = )'.-py; 
is the simple multiplicity estimator suggested by Skinner 
(1991): 


oD te YN y/Eln yl, (1) 


here P denotes the entire population, and 7,;) is the number 
of times unit 7 is selected for the sample from any frame. 
Observe that 1, = 0 for the population units not in the 
sample. In the great majority of applications, 1;;) will be 
one for most sampled units, but 1, > 1 is a possibility with 
this design. 

The expected number of times unit 7 will be selected for 
the sample is E[n,,.] = ks Pig, where pi is the probability of 
selecting unit 7 in the stratified simple random sample from 
frame /’; that is, Pig = Nn! Nn» where unit 7 is in stratum h of 
frame f. 

There is also a Horvitz-Thompson estimator for T under 
the design, namely 4,,, = Wale y,/m,, where S denotes the 
sample and a,=1-(1-p,)(-p,)"(- pp). See 
Bankier (1986) for further discussion of this approach. 


Phillip S. Kott, Research Division; John F. Amrhein, Survey Sampling Branch; and Susan D. Hicks, Estimates Division, National Agricultural Statistics 


Service, USDA. 


3. SAMPLING STRATEGIES USING 
PERMANENT RANDOM NUMBERS 


The sampling design discussed above is independent 
across frames. For many surveys, however, it would be 
convenient if the design were not independent across 
frames. This is because all units in the combined sample 
are given the same survey instrument, and many units are in 
anumber of frames. Therefore, given frame/stratum sample- 
size targets, a design with a tendency towards selecting the 
same unit in every frame should result in a smaller overall 
number of contacts (and consequently survey costs) than 
independent sampling across frames. 

To this end, suppose each unit has been given a target Pir 
in each frame to meet or exceed. This target value is 
constant for all units in stratum h of frame f We will 
withhold judgement on the policy of focussing on target Pir 
values — or equivalently on target n,, values — until the 
concluding section. Suffice it to say that many statistical 
agencies, including NASS, have such a policy. 

One potential sampling design assigns each unit in the 
population a permanent random number (PRN) drawn from 
the uniform distribution on the interval [0, 1). Unit 7 is 
selected for the frame fsample when its PRN is less than 


"The result is a Poisson sample where the probability of 
selecting unit 7 for the sample is 1, = max, {Pi }, which is 
clearly at least as large as each individual Pi Gor a given 
unit. An unbiased Horvitz-Thompson estimator for T 
under this design is ¢, = = Vices ¥;/max Py} 

Under Poisson sampling, sample § size is random. One 
way to reduce the variance of the sample size is with a 
variant of this sample design. In collocated PRN sampling, 
each population unit is assigned a unique PRN from among 
the members of the set {e/N,(1 + eV/N, (2 + e)/N,..., 
(N - 1 + e)/N}, where e is a uniform random variable drawn 
from the interval [0, 1). To this end, one can first draw 
provisional PRN’s for each unit followed by a value for e. 
The unit with the smallest provisional PRN is assigned a 
collocated PRN of e/N, the units with the second smallest 
provisional PRN is assigned (1 + e)/N, and so on until 
(N- 1+ e)/N is assigned to the unit with the largest 
provisional PRN. The estimator ¢,, remains unbiased under 
collocated sampling. 

Due to random nature of the sample sizes resulting from 
Poisson and collocated sampling, frame/stratum sample size 
targets may not be met when a particular sample is drawn. 
A third PRN design begins with target Nin values and 
removes this possibility. In this design, the units in stratum 
h of frame f with the ,, smallest PRN’s are selected for the 
sample (this is very similar to sequential Poisson sampling 
in Ohlsson 1995). A Horvitz-Thompson estimator under 
this fixed-sample-size PRN design requires one to compute 
the selection probabilities of the sampled units — a difficult 
task which may have to be approximated by simulation. 
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4. ASYSTEMATIC PROBABILITY 
PROPORTIONAL TO SIZE DESIGN 


Another sampling design with the same selection 
probabilities as the Poisson (and collocated) sampling 
scheme described in the last section consists of the follow- 
ing steps: 


0) When necessary, create an additional “stratum” for each 
frame consisting of those units not in any design 
stratum. 


— 
wm 


Divide up the population into mutually exclusive cells 
by cross-classifying the strata from the various frames. 
A pair of units in a particular cell will then be in the 
same stratum of each frame (e.g., the large oats stratum, 
the medium grain stocks stratum, and the no sorghum 
stratum). 


2 


—S 


Randomly order the population units in each cell and 
then sort the cells themselves in any order. This results 
in a list of all population units. 


3) Draw a systematic probability proportional to “size” 
(PPS) sample from this list using the 2, described in the 
discussion of Poisson sampling as the measures of size 
(the word “size” is in quotes because the 7m, are not 
really size measures in a conventional sense). This 
ensures that a unit’s selection probability equals 1,. 


The systematic PPS sampling design introduced above 
will always result in a sample of size close to vier m,. In 
fact. if ye m, is an integer, then the sample size will 
exactly equal that sum. Otherwise, the sample size will be 
one of the two integers closest to vier My: Similarly, the 
expected number of sampled units in a cell, C, will be 
Viec 7%; while the actual sample size will either be ),..7, 
or one of the two integers closest to it. 

Consider now a particular stratum / in a particular frame 
f with target sample size ny,. For a unit 7 in this stratum, 
1, > Nyn/Ny, by design. Let P(fh) denote the set of 
population units in stratum fh. The expected number of 
sampled units in fh is Diep yn) 7 2 Ny,. There is no 
guarantee that the realized sample size in the stratum will 
be greater than or equal to n,. Nevertheless, given the 
above inequality and the lower bounds on the sample sizes 
of the cells within fh, the sample size in stratum fh will 
never be far below 7, . 

The advantages of this design over Poisson and 
collocated sampling is that it produces a more stable size 
and a greater likelihood of meeting frame/stratum require- 
ments. Fixed-sample-size PRN, by contrast, will always 
meet frame/stratum requirements, but it does so at a cost: 
the design has a less stable overall sample size, and 
selection probabilities can be very difficult to determine. 
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5. EVALUATION OF THE ALTERNATIVE 
SAMPLING TECHNIQUES 


To evaluate these sampling techniques empirically, we 
selected three states that conduct NASS’s Vegetable 
Chemical Use Survey and replicated the three PRN tech- 
niques, the systematic PPS method, and independent 
sampling across frames 100 times. The assigned PRN’s 
were maintained across the three PRN techniques within 
each replicate. A separate frame was constructed for each 
commodity of interest within a state (the number of frames 
ranged from two in Minnesota to 23 in California). Popula- 
tion units were allocated to one of four strata in each frame; 
two probability strata, one take-all stratum, and one zero 
stratum were used in each frame. Stratum boundaries were 
determined using a modified Lavallée and Hidiroglou 
(1988) method, and units were assigned to strata based on 
a cum?V f(x) rule (Sweet and Sigman 1995). This 
stratification was chosen to mimic what might be a 
reasonable or reasonably common univariate sample design. 

A target sample size of one-third the population was 
selected from each of the probability strata. Table 1 
compares the overall sample sizes realized from each of the 
sampling techniques. As expected, the independent frame 
approach realized the largest sample sizes. The three PRN 
techniques realized sample sizes of similar size with the 
Poisson method experiencing the highest standard devia- 
tions in each of 3 trials (states). The PPS method appears 
to be the most stable. 


Table 1 
Mean realized sample sizes (over 100 replications) 


Independent Fixed Poisson Collocated Systematic 
State Frame Sample Size PRN PRN PPS 
Method Method Method Method Method 

CA 496 388 375 374 373 
(8.8) (9.6) (11.1) (5.6) (14) 

MI 658 sylé 504 501 502 
(9.3) (9.2) (13.6) (6.0) (.48) 

NJ 563 359 343 344 343 


(8.1) (8.6) (13.8) (4.6) (.17) 


Population sizes are: CA-775; MI-1041; NJ-785. 
Standard deviations are in parentheses. 


Table 2 shows the percentage of strata-level Poisson and 
PPS samples that fell short of their target sample sizes. One 
reason more shortfalls were not observed in the Poisson 
methods’ realized sample sizes is the occurrence of what we 
call “visitors”. A visitor is a sample unit that was not chosen 
within a specific commodity’s frame, but ends up in the 
sample because it was selected in another commodity’s 
frame. The existence of visitors tend to cause frame-level 
sample sizes to be larger, on average, than the targeted sizes. 


Figure 1 shows cumulative distributions of differences 
between realized and desired sample sizes as percents of the 
desired sample sizes for the sampled strata. That is, the 
cumulative distribution of (realized — desired)/desired at the 
probability stratum level. For example, Michigan had 13 
commodity frames each with two probability strata. 
Sampling from these frames was replicated 100 times so 
that the cumulative distribution function (CDF) for each 
technique utilized 2600 points. The two Poisson methods 
are shown as a single line since they coincide. The Poisson 
methods do not over-sample as much as the fixed-sample- 
size and independent frame methods, but at the risk of 
under-sampling as we saw in Table 2. The fixed-sample- 
size techniques (with dependent and independent frames) 
do not experience under-sampling, but do experience more 
over-sampling than the Poisson and PPS methods. The PPS 
method experiences some under-sampling but not to the 
extent of the Poisson methods. The PPS design also shows 
the steepest gradient of all the CDF’s, indicating that it 
realizes less over-sampling. 


Table 2 
Percentage of probability strata for which the realized 
sample size fell short of the target (in 100 replications) 


State Poisson PRN Collocated Systematic 
Method PRN Method PPS Method 
CA 11% 11% 6.3% 
MI 12% 12% 6.3% 
NJ 11% 8% 1.4% 


Under the Poisson and collocated techniques, the 
probability of selection for unit iis 7, = max -(p,,) where h 
corresponds to the stratum in which i belongs for frame f 
The same probability of selection is used for the PPS 
technique. By contrast, the probabilities of selection under 
the fixed-sample-size PRN method are difficult to 
determine and may need to be simulated. 

Such a simulation was conducted using the California 
data. The fixed-sample-size technique was run 10,000 
times. Since all probability strata were sampled at a rate of 
1/3, the simulated probabilities (i.e., relative frequencies) 
can be compared to 1/3. The mean simulated probabilities 
of selection over the 10,000 trials are shown in Figure 2 as 
a function of the number of frames in which the unit is 
contained within a probability stratum. There were 19 
commodities of interest in this state, but no units existed in 
probability strata in exactly 16 or 19 frames. A unit’s 
probability of selection tends to increase with the number 
probability strata containing it. This selection probability 
is 1/3 only when the unit is in exactly one such stratum. 


Kott, Amrhein and Hicks: Sampling and Estimation From Multiple List Frames 


Realized & Desired Sample Size Comparison 
CDFs Over 100 Replications 


50 100 
Difference as Percent of Desired 


Realized & Desired Sample Size Comparison 
CDFs Over 100 Replications 


50 100 
Difference as Percent of Desired 


Realized & Desired Sample Size Comparison 
CDFs Over 100 Replications 


50 100 
Difference as Percent of Desired 


Figure 1. Comparison of realized and desried sample sizes for sampled strata. Top - MI; 
middle - CA; bottom - NJ. 
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Simulated Probabilities of Selection 
For Fixed Sample Size Method 
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Figure 2. Simulated probabilities of selection for the fixed-sample- 
size method-California 


6. CALIBRATION 


The problem with both ¢,, and f,, (or ¢,,,) is that they are 
often not very good estimators for T in term of precision 
(variance). One of the properties of single-frame, stratified 
simple random sampling is that the conventional expansion 
estimator estimates the stratum population size perfectly 
(i.e., with zero variance). In our multiple frame set up, 
however, neither ty, nor t, will estimate the Nin perfectly 
in most applications. 

Let us define w? = m;)/E[n)] as the original sampling 
weight of unit iin t,,. Similarly, w? = 1/max valent: 
and 1/7, more generally for a Horvitz-Thompson estimator. 
Bankier (1986) proposed raking to create a set of adjusted 
weights such that 


DS wi" iz Nv (2) 


for each stratum / in every frame f, where S, is that part of 
the sample that is in stratum / of frame fregardless of the 
frame(s) from which the units were selected. 

Deville and Sarndal (1992) call (2) a calibration equa- 
tion. They point out that there are a number of ways to 
compute the calibration weights, the w,°, so that equa- 
tion (2) is satisfied and w/w? is in some sense close to 1 
for all 7. One method is raking as suggested by Bankier 
(1986). Another method, discussed at length by Deville and 
Sarndal (1992), uses least squares. Either way, the resulting 


estimator 
4 € 
lo= Doe Wi Nie 


where S denotes the entire sample, will be nearly design 
; a 0. 4 
unbiased because w, /w, is close to | for all 7. 
The estimator f,. is also unbiased under the model: 


18 ih 
Y= Bot DD din, By, + Ei» (3) 


p 


where the dummy variable, ae, | is 1 when unit 7 is in 
stratum h of frame f (sampled or not) and zero otherwise, 
while €, is a random variable with a mean of zero. The B, 
and the By, are unknown constants (f, represents the mean 
y-value for a unit in the first stratum of every frame; that is 
why the second sum excludes h = 1). The same di, values 
apply to every survey item (y) of interest, while the B 
values change with the survey item. For many survey items, 
B,,, values will be zero when frame f(say, grain stocks) is 
irrelevant to the item (say, planted oat acres). 

Isaki and Fuller (1982) call the model expectation of the 
design mean squared error of fc the “anticipated mean 
squared error” of the estimator. This value is of most use at 
the planning stage of a sample survey. 

If the model in equation (3) holds, and the €, are uncor- 
related, then the anticipated mean squared error of be is 


E[MSEp(t.)] = BED IY), # ¥,- Lip WD) 
=E,(E.1),, ¥i7 dp YO 

= EEO, w €;-dop €)°1} 

= Ed, (00, - 20, E£(e7)} +, EE;) 

= Ep{)), (ain) - WE (E;)} + Vp BCE) 

=>, (Im, - I)E(€?), (4) 


since w,~ = 1/n,. It is of some interest to note that using 
Poisson, collocated, and systematic PPS sampling result in 
estimators with approximately equal anticipated mean 
squared errors asymptotically. This surprising result is in 
part due to the nature of a calibrated estimator, but it is also 
a repercussion of the fact that when we take the design 
expectation of the approximate model variance in the last 
line of equation (4), we average over all possible samples 
and remove the biggest source of variation among the three 
sampling designs. 

Now suppose we had used stratified simple random 
sampling and selected unit 7 with probability pjy< 7,, 
where f is the frame relevant to y. It is not hard to show 
that the anticipated variance of the simple expansion 
estimator would have been J, (I/p,-- 1)E_(€;), which is 
at least as large as the right hand side of equation (4). Thus, 
there are gains — in large samples, at least — from 
“integrating” the samples from various frames as we have 
effectively done. How large the samples must be in practice 
for the asymptotic results to be relevant is unclear. At the 
very least, the sample size must be many times the number 
of model parameters in equation (3). 

A few words on mean squared error estimation for ¢,, are 
in order. The mean squared error estimator advocated by 
Deville and Sarndal (1992) — an estimator with both good 
design and model-based properties — can not be implemented 
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unless the joint selection probability (1,) for every pair of 
sample units (i and j) is known. Among the designs we 
have discussed, these probabilities are easily calculated 
only for the Poisson variant of PRN (where 1, = 1,71). 

As we have observed in equation (4), the anticipated 
mean squared error of the calibration estimator is the same 
under Poisson PRN, collocated PRN, and systematic PPS 
sampling. This suggests that the Poisson mean squared 
error estimator may be reasonable under each of the three 
designs. A stronger model-driven argument exists for this 
contention, but will not be made here. 


7. DISCUSSION 


In the last section, it was pointed out that if calibration 
weights were designed to satisfy equation (2), the resulting 
estimator would be unbiased under the model in equa- 
tion (3). In many applications, there may be a more 
appropriate model on which to base calibration than the one 
in equation (3). For example, if there was a continuous 
control variable used to stratify a particular frame, it makes 
more sense to use that variable directly in the model rather 
than indirectly through frame/stratum identifiers. 

Raking is a form of calibration under a particular model. 
With that in mind, it makes sense to use the most reasonable 
model available. Least squares has the advantage over 
raking that it can easily be applied to continuous control 
variables. Singh and Mohl (1996) provide an extensive 
review of alternative calibration algorithms including an 
extension of raking to continuous variables. An intriguing 
least-squares variant missed by Singh and Mohl (1996) can 
be found in Brewer (1994). 

Many economic and agricultural surveys employ rotating 
sample designs. This has proved an effective way to 
balance cost and burden considerations. Although our 
empirical findings demonstrated an advantage of the sys- 
tematic PPS methodology in terms of meeting target sample 
sizes, the three PRN designs are much more conducive to 
sample rotation. See, for example, Ohlsson (1995) on this 
topic. Moreover, with the PRN methods, one can integrate 
different frames at different times of the year (with systema- 
tic PPS there is no easy way to allocate the sample back to 
the frame of origin). This is a particularly useful property 
for agricultural surveys because different crops have 
different growing seasons. 

In summary, the fixed-sample-size PRN sample design 
is excellent for meeting target sample sizes but is hard to 
use in practice because selection probabilities are usually 
unknown and must be simulated. The systematic PPS 
design is very good at meeting target sample sizes but is 
difficult to incorporate into a sample rotation scheme. 
Moreover, mean squared error estimation requires invoca- 
tion of model assumptions. Our empirical example shows 
that collocated sampling may only be slightly better than 
Poisson at meeting target sample sizes. It should be recog- 
nized, however, that other configurations of the frames, 


strata, and sampling fractions may produce different results. 
Moreover, collocated sampling is conducive to rotation 
schemes, like Poisson sampling. On the other hand, like 
PPS sampling, it requires the assumption of a model to 
estimate mean squared error. 

Finally, setting p,, or n,, targets is a popular, but indirect, 
means of controlling the variance of the estimator f. 
associated with each frame. These targets lead to our ad hoc 
decision to set m, equal to max,{p,}. A more direct 
strategy would be to set (asymptotic) anticipated variance 
targets for each frame estimator using equation (4) and 
postulated values for the E. (€? ). One could then choose, 
say, the set of 1, that minimizes the expected sample size 
yet satisfy these variance targets. A similar approach is 
taken by Amrhein, Fleming, and Bailey (1997) who use 
Chromy’s algorithm in a manner analogous to Sigman and 
Monsour (1995). Poisson PRN, collocated PRN, and 
systematic PPS sampling remain three viable alternatives 
for selecting the sample once optimal 7, are determined. 
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Use of Auxiliary Information for Two-phase Sampling 


M.A. HIDIROGLOU and C.-E. SARNDAL! 


ABSTRACT 


Two-phase sampling designs offer a variety of possibilities for use of auxiliary information. We begin by reviewing the 
different forms that auxiliary information may take in two-phase surveys. We then set up the procedure by which this 
information is transformed into calibrated weights, which we use to construct efficient estimators of a population total. The 
calibration is done in two steps: (i) at the population level; (ii) at the level of the first-phase sample. We go on to show that 
the resulting calibration estimators are also derivable via regression fitting in two steps. We examine these estimators for 
a special case of interest, namely, when auxiliary information is available for population subgroups called calibration 
groups. Poststrata are the simplest example of such groups. Estimation for domains of interest and variance estimation are 
also discussed. These results are illustrated by applying them to two important two-phase designs at Statistics Canada. The 
general theory for using auxiliary information in two-phase sampling is being incorporated into Statistics Canada’s 
Generalized Estimation System. 


KEY WORDS: Generalized regression; Two-phase sampling; Model assisted approach; Domain estimation; Calibration 


factors. 


1. INTRODUCTION 


Two-phase sampling is a powerful and cost-effective 
technique. It was first proposed by Neyman (1938). In 
Cochran’s (1977) book, and in its two earlier editions dated 
1953 and 1963, one finds basic results for two-phase 
sampling, including the simplest regression estimators for 
such designs. This paper takes a broader outlook and 
proposes a general approach to the use of auxiliary 
information in two-phase survey designs. Our main 
references are Sdrndal and Swensson (1987), Sarndal, 
Swensson and Wretman (1992) and Dupont (1995). Recent 
related work includes Breidt and Fuller (1993), who 
presented computationally efficient estimation procedures 
for three-phase sampling in the presence of auxiliary 
information. Chaudhuri and Roy (1994) studied optimality 
properties of the well-known simpler regression estimators 
for two-phase sampling. Binder (1996) described a simple 
linearization procedure to estimate variances of nonlinear 
estimators. His procedure can be applied to any sampling 
design, including two-phase-sampling. Throughout this 
paper, we assume arbitrary sampling designs for each of 
the two phases. 

Single-phase sampling involves the use of one layer of 
information for estimation. In two-phase sampling, how- 
ever, one has to consider two layers of information. This 
complicates matters, and it is not clear-cut how best to 
exploit the combined information from the two sources. 
Two approaches are considered in this paper for building 
estimators based on auxiliary information. These are the 
calibration approach and the generalized regression 
approach. We show that the generalized regression 
approach can be viewed as a special case of the calibration 


approach. The two approaches are examined under a 
common structure for the auxiliary information. It assumes 
that information exists about an auxiliary vector x, for the 
units of the entire population, and about a second auxiliary 
vector x, for the units of the first phase sample. 
Consequently, at the level of the first phase sample, there is 
information about both vectors, x, and x,. 

The generalized regression approach, as applied to two- 
phase sampling, is discussed in Sarndal et al. (1992). These 
authors develop the general regression estimator for two- 
phase sampling, assuming arbitrary sampling designs in 
each of the two phases. Two regression fits are carried out. 
A “bottom level” regression is fitted to produce predicted 
values up to the level of the first phase sample, using the 
auxiliary information available for this step. Next, a “top 
level” regression is fitted to produce predicted values up to 
the entire population level, using the information 
appropriate for this step. The two sets of predicted values 
are used to build a generalized regression estimator. 

The calibration approach focuses on the weights given 
to the units for purposes of estimation. Calibration implies 
that a set of starting weights (usually the sampling design 
weights) are transformed into a set of new weights, called 
calibrated weights. The calibrated weight of a unit is the 
product of its initial weight and a calibration factor. The 
calibration factors are obtained by minimizing a function 
measuring the distance between the initial weights and the 
calibrated weights, subject to the constraint that the cali- 
brated weights yield exact estimates of the known auxiliary 
population totals. In two-phase sampling the two levels of 
information imply two consecutive calibrations. The first 
phase of calibration uses the auxiliary information available 
(at least population counts) at the level of the entire 
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population, resulting in first-phase calibrated weights. The 
second phase of calibration uses these first-phase calibrated 
weights and incorporates the information at the level of the 
first-phase sample, resulting in a final set of calibrated 
weights. 

Both approaches profit from the two layers of infor- 
mation. They do not necessarily yield identical results. 
Whether they do or not depends on the exact formulations 
given to the regression fits and the calibration approach. 
This is apparent in Dupont (1995), where four alternative 
estimators were developed under the regression approach. 
These differ in the way that the auxiliary variables are used 
in deriving the predicted y-values required for the 
regression estimator. For each of these four approaches, 
Dupont built a matching estimator using the calibration 
approach. She succeeded in obtaining an exact equivalence 
between the two approaches in only one of the four cases. 
Three of Dupont’s four approaches can be considered as 
special cases of the general approach in this paper. 

In this paper, building on Hidiroglou and Sarndal (1995), 
we provide a unified theory for two-phase sampling with 
auxiliary information. We show that the regression estima- 
tors can be obtained as a special case of the calibration 
approach. Direct linkage between the two approaches is 
therefore possible. One motivation for our work was the 
necessity to provide tools for efficient use of administrative 
data sources in several important Statistics Canada surveys. 
Our work has prepared the way for the inclusion of two- 
phase sampling into Statistics Canada’s Generalized 
Estimation System described in Estevao, Hidiroglou and 
Sarndal (1995). 

We illustrate our general theory by applying it to two 
survey designs currently used at Statistics Canada. The first 
application, Armstrong and St-Jean (1994), describes the 
use of the two-phase approach for sampling tax records. 
Our second application, Hidiroglou, Latouche, Armstrong 
and Gossen (1995), involves the use of two-phase sampling 
of payroll deduction accounts used in Statistics Canada’s 
Survey of Earnings, Payrolls and Hours. 

The paper is organized as follows. Section 2 sets up the 
notation. Section 3 specifies our version of the calibration 
approach in two-phase sampling. Section 4 establishes the 
important result that the resulting calibration estimator can 
be expressed, with exact equivalence, as a two-phase 
regression estimator, that is, one derived via two consecu- 
tive regression fits. Additional theoretical results are 
reported in Sections 5 and 6. Section 5 examines the forms 
taken by our two-phase calibration estimator under impor- 
tant special types of information, namely, when some of the 
auxiliary variables, either in the first or in the second phase, 
correspond to categorical variables that codify a grouping 
of the units into mutually exclusive and exhaustive classes. 
Section 6 gives results on two issues that always require 
attention in a survey, which are central to the GES, namely, 
(a) estimation for domains (sub-populations), and (b) 
design-based variance estimation. For variance estimation 


we use the approach of Sarndal and Swensson (1987). 
Section 7 shows how the preceding theory is applied to 
two-phase designs currently in use at Statistics Canada. 
Finally, Section 8 provides a brief summary. 


2. NOTATION 


The population is represented by U = {1,...,k,..., N}. 
A first-phase probability sample s,(s, ¢ U) is drawn from 
the population U, according to a sampling design with the 
selection probabilities 1,, = P(kes,). Given s,, a second- 
phase sample s,(s,¢s,c¢U) is selected from s,, 
according to a sampling design with the selection proba- 
bilities 1,, = P(kes,|s,). Note that these are conditional 
probabilities, given s,. We assume that ,, >0 forall ke U 
and m,,>0 forall & € s,. From this point on, we work with 
weights in the estimation process. We will denote the first- 
phase sampling weight of unit k as w,, = 1/m,,, and the 
second-phase sampling weight as w,, = 1/7,,. The overall 
sampling weight for a selected unit is w, = w,, W»,. 

Our objective is to estimate the population total Y = 
yy, Where y, is the value of the variable of interest y 
for unit k. If Ac U is an arbitrary set of units, we write 
simply )’, for )),..,. The customary two-phase sampling 
procedure calls for collecting inexpensive information 
about the units & belonging to a large first-phase sample s,. 
This information is then used to realize efficient sampling 
and estimation in the second phase. The values y, are 
recorded for ke s,. An unbiased estimator of Y is given by 
Wars) 3 W, y,- This estimator uses sampling weights only. 
A more extensive use of available auxiliary information is 
achieved through the regression estimators that we will now 
examine. 

We denote the auxiliary vector at the level of the first- 
phase sample as x and its value for unit k as x,. As in 
Sarndal et al. (1992, chapter 9), we partition x, as 
X, = (X,,,*3,)’. Information is available up to the entire 
population level for the vector x,,, whereas for the vector x,,, 
information is only available up to the level of the first- 
phase sample. Table 1 summarizes our assumptions on the 
auxiliary information available for estimation. 


Table 1 
Relationship between set of units and available data 
at different levels 


Data available 
{x,,ike u} Oty) Xie 
{x, ; kes,} 
1(%1.594): kes,} 


Set of units 


Population 
First-phase sample 


Second-phase sample 


Note that individual values x,,,k€ U, are not required. It 
suffices to know the total )’,, x,,, which may be taken from 
a reliable administrative source. The presence of auxiliary 
information in one or both phases opens the possibility of 
modifying the sampling weights with the aid of calibration 
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factors calculated using the auxiliary information. In each 
of the two phases, a unit’s sampling weight is modified by 
multiplying it by the calibration factor, resulting in a 
calibrated weight. 

The first-phase calibrated weight w,, is computed for 
units kes, as W,,=w,,g,,- The first-phase sampling 
weight is Ww, ,» and the first-phase calibration factor is g,,. 
Similarly, we compute overall calibration weights w, = 
w, 8, forunits kes,, where g, is the overall calibration 
factor. The superscript “*” denotes overall weights taking 
into account both phases. The superimposed symbol “~” 
denotes calibrated weights. 


3. CALIBRATION WITH GENERALIZED 
LEAST SQUARES DISTANCE 


Auxiliary information available at each phase of sam- 
pling can improve weights by the process known as 
calibration. This improvement yields smaller variances of 
the resulting estimates if there is a strong correlation 
between the auxiliary variables and the variables of interest. 
We seek a set of “new” weights that lie as close as possible 
to a set of starting weights. The calibration requires the 
specification of a measure of the distance between the 
starting weights and the new weights. Several distance 
functions have been proposed; see Deville and Sarndal 
(1992), Deville, Sarndal, and Sautory (1993), and Singh 
and Mohl (1996). Any one of these distance functions could 
be used for two-phase calibration. However, we concentrate 
on one of these, namely, the generalized least squares 
(GLS). For an arbitrary set of units s, it is of the form 


D=—), Og me 


2 
mde. (3.1) 
k 


where {w,: kes} are the starting weights, {w,:kes} are 
the new calibrated weights, and {C,: kes} are specified 
positive factors that control the relative importance of the 
terms of this sum. For each of the two phases, we minimize 
a GLS distance measure with suitable factors C,, subject to 
constraints. After applying the two successive calibrations, 
we have a set of overall calibrated weights. 


(i) First-phase calibration (from s, to U). 


The first-phase sampling weights {w,,:kes,} are used 
as starting weights. Let {C),:kes me be pre-specified 
positive factors. We determine, the first- -phase calibrated 
weights by minimizing the GLS distance 


1 OW Wi)” 
D etal e ee 


(3.2) 
1 
2 lk 
subject to the first-phase calibration equation 
Des, Mike = Deu 1 (3.3) 
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where the total )’,,.x,, is known. Note that this calibration 
does not involve information concerning x,, because it is 
available only up to s,. 

The resulting weights are 


Wig = Wi Six (3.4) 
with 
iT! XK 
(Va xy, WX] on ke) 
lk 
and 
xe 
T, = =>, at lk (3.6) 


lk 


Some of the w,, given by (3.4) may be negative, or zero. 
Many users prefer weights to be always positive. This can 
be achieved by adding to (3.3) the inequality constraints 
W,,> 0 for all kes,. The resulting weights have no closed 
expression, in contrast to (3.4). 


(ii) Second-phase calibration (from s, to s,). 


We use {W,,w,,,k€5,} as starting weights, where w,, 
is given by (3.4). These weights incorporate the information 
about x,, available up to the full population level. 
Applying them to the data {y,:ke€s,} yields one possible 
estimator, namely Y = Y.. Wi, 2, ¥,- However, since these 
weights do not contain the x,,-value information available 
for kes,, they can be improved through a second-phase 
calibration. Let {C,,:k€s,} be specified positive factors. 
We determine the overall calibrated weights w, by 
minimizing 


~* ee 2 
D Ley Cy (Wp — Wy, Wo) 67) 
2 R = 
ee: Wig Wor 
subject to the second-phase calibration equation 
ae = W 3.8 
ae Wee, et et C9) 


where x, = 
weights are 


aa Xap) The resulting overall calibrated 


Mee ee (3.9) 
where 
Si = Sin Br (3.10) 
with g,, given by (3.5) and g,, by 
8, = 1 *(D Puke” Le, Fie Wre%e) Teen LL) 


2k 
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for kes,, and 
~ / 
T D5 WipWoe Xe X 5 
2 Sy 


4@ 
C (3.12) 


2k 


Again, some g, may be zero or negative, but always 
positive g, can be ascertained by adding to (3.8) the 
inequality constraints w, >0 for ke oa 

Having determined the overall weights w, by equation 
(3.9), the estimator of Y is given by 


P-> 


Remark 3.1 A potential problem with the above approach 
is that some of the g,,’s_ may be negative or even zero. If 
this occurs, (3.7) is not a proper distance measure. Some of 
the important applications, such as poststatification, do not 
have this problem as their associated g,,’s are always 
greater than zero. If all the g,,’s are greater than zero, then 
the minimization criterion given by (3.7) is acceptable. 
Otherwise, we have to modify it. One possible modification 
is to impose on the above-mentioned constraints that the 
w,,’S are positive for k € s,. Another possible modification 
is to replace C,, in (3.7) by 


Wi; (3.13) 


Cy) 


Then 


which is always positive. The resulting g,-factors in (3.9) 
can be shown to be g, = g,, + 2, — 1, where g,, is given 
as before by (3.5), and g,, by (3.11) provided that we 
instead define 7, as 


Le ee 


It is our opinion that in most applications the choice 
between the multiplicative g, = £1, 8>, and the additive 
form g, =81, +8 ,-1 would have little effect on the 
resulting estimates. That is, we believe the two point 
estimates would be very close, and so would be their 
associated estimates of variance. 

Remark 3.2: Bounding the weights ordinarily has negli- 
gible impact on the estimates. Recent experience with 
calibration for single phase designs, Stukel, Hidiroglou, and 
Sarndal (1996), has shown that mildly different sets of 
g-weights lead to point estimates that differ very little. 
Some recently developed computer software for calibration, 
for example, the software described in Deville et al. (1993), 
minimizes a distance function such that the resulting 


We aX es 


Cy 


g-factors are guaranteed to be bounded from above and 
from below. 


Remark 3.3: The auxiliary data in Table 1 can be used in 
several ways for two-phase calibration. Considering in 
particular the second-phase calibration equation defined by 
(3.8), three different specifications of the vector x, are: (i) 
x= Wie oe) (ny x, =Xop) atid (iN) Xai XW 
comment on these possibilities, assuming for each of these 
that a first-phase calibration has been carried out, resulting 
in the first-phase calibrated weights (3.4). 

The case (i) specification x, = (x,,,,X3,)’, recommended 
in Sarndal et al. (1992), capitalizes on all the available 
information. Thus, in this respect case (i) is ideal. Cases (11) 
and (iii) disregard some available information. Case (11) is 
sometimes of interest, despite some loss of information; an 
example is given in Section 7.1. Case (i11) implies that the 
data {x,,:kes,} are observed, but not used: we do not 
further consider this case. We call x, = (x,,,%;,)’ the full 
vector and x, =x,, the reduced peeicn 

Second-phase calibration on the reduced vector x, = x,, 
can be carried out without significant loss of information if x,, 
is a good substitute for x,,, as also observed by Dupont 
(1995). However, if x,, complements x,,, then the full 
vector xX, =(X,,,%,,)’ should clearly be used in the 
calibration defined by (3.7). Otherwise, significant loss of 
information and increased variance may result. 


1k’ 


1 


Remark 3.4: Both the full and the reduced x,-vectors lead 
to overall weights Wy , calibrated on x,, from a to s,. This 
means that ¥ Wx sya Xa “pecauise (3.8) holds, 
and x,, is contained in Sa However, there exists a 
difference between the full and reduced vector specifica- 
tions with respect to the calibration on x,,. If the full vector 
specification is used in phase two, the resulting overall 
weights w , are calibrated on x, , from s, to s,, andfrom s, 
to U. This eemetint YW, Sone ys aE xi Ox ee 
contrast, if the reduced’ vector specification is used, the 
resulting overall weights w, are calibrated on x,, from’s; 
to U by virtue of the first-phase calibration. That is 
y, WieXie = LyX. However, they are not calibrated 
from s, to s,, because x,, is not present in the second- 
phase calibration. Hence, )) W,x,,4#)) Wi, X= 
yy X,- Thus if the survey peatnicees a weight system that 
will reproduce the known )’,,x,,, then the full vector 
specification must be used. 

So far, we have focused on the general framework for 
calibration with two levels of auxiliary information. This 
framework does not reveal the many interesting forms that 
the estimator Y given by (3.13) may take for specific cases 
of auxiliary information. Some illustrations are given in 
Section 7. We first address three issues that are of practical 
interest in virtually every major survey: (i) poststratifica- 
tion or, more generally, the presence of auxiliary informa- 
tion for population subgroups (Section 5), (ii) estimation for 
domains of interest (Section 6), and (iii) the construction of 
variance estimates (Section 6). 
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4. THE TWO-PHASE CALIBRATION 
ESTIMATOR VIEWED AS A REGRESSION 
ESTIMATOR 


An alternative expression for the calibration estimator 
(3.13) is given by formula (4.1) below. This expression 
links it exactly with the regression estimator for two-phase 
designs introduced in Sarndal et al. (1992, chapter 9). 


Theorem 4.1: When the overall calibrated weights w, are 
determined by (3.9), the calibration estimator (3.13) is 
identical to the two-phase regression estimator given by 


Fee Sibt Ds: 


Wig Vou Vie) Bog We (Ye Inu) 4-1) 


where y,, and y,, are successive regression predictions 
such that 


Vie =X B, (4.2) 
with 
OH REDS Wig Xe Pr >> Wie X14 Ve Yrg) (4.3) 
1 1 AY Tt M7ahiuak AY lie! fae (alae! F 
atta : Ci 
where T, is given by (3.6), and 
iV este Bs (4.4) 
with 
“ i Wp Wp Xi Vq 
Bye aes a (4.5) 


2k 


where T, is given by (3.12). 

The proof for Theorem 4.1 uses some tedious but 
straightforward algebra and is not presented here. 

We now show that (4.1) can be constructed via 
regression estimation in two steps. For the first step, 
suppose that the variable of interest y, were observed for 
the full first-phase sample s, . The auxiliary information on x,, 
is available for kes, and the population total )',x\, is 
known. The resulting regression estimator of Y =)’, y, 
would then be given by 


ae Py + Ne Wu (Ye Ire} 
=)I. 


In the last expression, the first term represents the 
(hypothetical) first-phase Horvitz-Thompson estimator of 
Y. The second and third terms represent a regression 
adjustment, where Vi is the predictor of y, based on the 
fitted regression of y, on x,, for kes,. That is, 
a0 pio) des 
vipa ea with 


a0 AY 
Weis Jase by Vaio yBy, Wik Vik (4.6) 


1 


1k 
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AQ -] Wik x 
B,=T, ie . 
lk 


Note that )’,, = OES BY where )’,, X,, is known. 
However, none of the terms in (4.6) can be computed 
directly, because y, is only observed for the second-phase 
sample. A second step of regression estimation is thus 
necessary. It is carried out by replacing the unknown 
a w,,y, in (4.6) by its conditional regression estimator 


bee Wi, Vox + DSS We (Ye-Dog ) 


lk Vk 


(4.7) 


where y,, =X, B,, with B, given by (4.5), is the predictor 
of y, based on the regression of y, on x,, known up to s,. 
Next, the vector B, required for computing y,, contains a 
known matrix 7, and an unknown vector 


> Wig XVq 
Chix 


Using a regression estimator for this unknown vector, we 
obtain B, given by (4.3) as a replacement for B. These 
two substitutions 4 in (4.6) lead to the two-phase rearession 
estimator given by (4.1), which is identical to the 
calibration estimator (3.13). 

Remark 4.1: A more direct alternative to B, in (4.3) 


would be to use only the second-phase sample. This would 
have produced 


-1 
* ’ * 
ROADS Wi Minin 3 ee Mae 
1, alt So C Sy G 
2k 2k 
The resulting predictions y,, = *,,B,, would be 


replacing y,, in (4.1). However, the resulting regression 
estimator is not identical to (3.13) and is a less efficient 
alternative, because B uses less x,,-information than 
B 


1, alt 
1° 


5. CALIBRATION GROUPS 


In this Section we apply the results of Sections 3 and 4 
to the important case where the auxiliary data in Table 1 
include information about mutually exclusive and 
exhaustive subsets of the population U, and of the first- 
phase sample s,. The population subsets are denoted by 
ORES ROD Bs ‘and the first-phase subsets by SiJ= 
Ie atase fs Such subsets are called calibration groups, for 
reasons that will become clear later in this Section. Simple 
examples of calibration groups are poststrata. 

Two vectors denoted A,, and A., will be used to specify 
the membership of a given unit kin the calibration groups U, 
and s,,, respectively. These group identifiers are 


rine os Opp? 


Ay, = (5.1) 


snp 
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with 
1 ifr U- 
O1iR = formas! war (5.2) 
0) otherwise 
and 
A, = (CORPS aay Sojq> Scr Ona) (5.3) 
with 
1 if ke Sy, 
Soi = for j settee I (5.4) 
0) otherwise 


Besides the group membership information, which is 
qualitative and specified by A,, and A,,, there may exist 
information for the unit & about quantitative (continuous or 
discrete) variables. We call them supplementary auxiliary 
variables. For example, categorical information about a 
unit (enterprise) in a business survey may consist of an 
industry code or a geographical location code. In addition, 
quantitative variable information may also be available 
concerning the number of employees or the gross business 
income of the unit. Some of these supplementary auxiliary 
variables may be known up to the level of the population, 
and others up to the level of the first-phase sample. 

We assume in this Section that the vector x,,, used in 
calculating the first-phase g-factors, has the structure 

Xi, = AY, 2, (5:5) 

where z,, of dimension Q, is the vector of supplementary 
auxiliary variables available for the first-phase sample. The 
information requirements in Table 1 apply to the vector 
x,,- This implies that we must know either the group 
membership specified by A,, and the value of z,, for every 
keU, or the total ¥ u, separately for each group, 
Viaeeed te ere ie 

When x,, has the form given by (5.5), the first-phase g- 
factors g,, in (3.5) can be obtained by a group by group 
calculation. The 7, matrix to be inverted, given by (3.6), is 
block diagonal and of dimension JQ, by J Q,. The typical 
diagonal block, denoted as T HiOk dimension 0, by Os iS 
given by 


ave 


> Wig Zip ag 

Ds elie ae (5.6) 
1k 

for i=1,...,/. The resulting i inverse of T, is also block 

diagonal Sie diagonal matrices ie * _ The off diagonal 


blocks of the inverse of T, are zero matrices. So we obtain 
from (3.6) 


644 
Fae ee (Nu, Zi, ae WZ) Ty — (5.7) 


Ci, 


forkes,,,i=1,...,/, where T,, is given by (5.6). Note that 
the resulting weights w,, are the same as those obtained by 
carrying out the first-phase calibration group by group, 
calibrating for group / on the known total Yu, Z,,- That is, 
Denies Vy 21x for i= 1,...,/. Itis thus ‘fitting to call 
the’ groups U, fi irst-phase PS groups. 

Now consider the second-phase g-factors g,, given by 
(3.11). They are based on the auxiliary vectors x,, 
required to be known for the units ke s,. We assume that 
x, contains information about the second-phase groups so 
that 

x, =A’, @2', (5.8) 
where A,, is the second-phase group identifier, and z, is 
the value of a vector of supplementary auxiliary variables 
available for kes,. Since the requirements in Table 1 
apply, it follows that A,, (the second-phase group 
membership) and the value of z, (the supplementary 
auxiliary vector) must be known for every ke s,. Here z, 
may contain some or all of the information in x,, given by 
(5.5), and any other information available for the units 
KES; 

When x, has the structure (5.8), the factors g,, can also 
be obtained through a group by group calculation. This 
simplification is a result of the fact that the matrix to be 
inverted in (3.11) is block diagonal. We obtain 


Spelt ( s,, Mire See Wy Woy Zp) Te C. (5.9) 
for kes, , = Sy S8yjrJ = Nhe bsg/g WHOSE 
Wi Wop 22 

ee (5.10) 


a Cox 

The resulting overall weights w, =w, g, where g, = 
£1, >, are the same as those obtained by carrying out the 
second-phase calibration group by group, calibrating for 
group j on the known quantity ds, MizZe,  Thatcis, 
Lig We Zpae ip op Okada ee "The groups s,, are 
called second- -phase calibration STOUDS, We now Have a 
procedure for computing g,, and g,, group by group using 
(5.7) and (5.9). The total Y is still estimated according to 
Gifs) 


6. DOMAIN ESTIMATION AND VARIANCE 
ESTIMATION 


The preceding sections dealt with estimation of the total 
of y at the entire population level. In most surveys, there is 
also a need to provide estimates for various subpopulations 
or domains of interest. Requests for domain estimates can 
be made either before or after the sampling stage of the 
survey. Auxiliary information is essential for domains. A 
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precise domain estimate may be obtained (even for small 
domains) if: (1) calibration groups and domains of interest 
agree closely, and (ii) the auxiliary variables exhibit a strong 
regression relationship with the variable(s) of interest. 

Denote by U,(U, ¢ U) any domain of the population 
U for which an estimate is required. The y-total for the 
domain U, is defined by Y(d) = Y'u,y, =) yy,(d@) with 
y, (ad) =y, if ke U, and y, (a) =0 if ke U;. 

The estimator of Y(d) is 


¥(d)=)0, Wy, (4) (6.1) 


where the overall calibrated weights w, =w, g, may be 
calculated group by group as described in Section 5. The 
calibration factors g,, and g,, are calculated using all 
relevant available auxiliary information, specified as in 
Table 1. So in this sense, the resulting overall calibrated 
weights Ww, are the best possible ones. Note that these 
weights are independent of the particular domains requiring 
estimation in the survey. 

The estimator of the variance for the domain total 
estimator Y (d) is obtained using a design-based approach. 
This means that the variance is interpreted with reference to 
repeated draws of samples s, and s,. Details for the 
derivation of this variance are given in Sarndal et al. (1992) 
(Result 9.7.1, p. 362). The first order and second order 
inclusion probabilities enter into the weights used in the 
variance formula. The weights associated with the first- 
phase sample are w,,=1/m,, and w,,,=1/7,,, with 
Ti =P(k and les,). The weights w,,=1/m,, and 
Woy =1/m,, with 1,,,=P(k and les,|s,) denote their 
second phase counterparts. Two sets of regression residuals, 
one for each phase, are also required. The estimator of the 
variance of Y (d) is given by 


v{P(a)} = 
ss » Wo ig Wi Wie Wiig) (G1, 1g(4)) (842404) + 


kes, les, 


(6.2) 
yy yy WW 1p Wop Wp~ Wig) (Srp 244) (S929 (Z)) 
kes, les, 
Note that for k =! we have w,,,=w,,, and w,,, = Ww», in 


(6.2). We now specify the regression residuals in (6.2) 
assuming that there are first-phase calibration groups 
U,,i=1,...,J, and second-phase calibration groups 
SyiJ = 1, ..., J, as explained in Section 5. We denote the 
associated sample subsets as follows: Se Soa) Ons 
Syy = Sy Sy; The required residuals in (6.2) are, for 
keds, OU); 


€,(d) =, (d) - 2, B,, (4) (6.3) 
and, for ke (sy, NU,) 
€,(d) = y, (a) - 2’, By, (d) (6.4) 
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The estimated regression vectors B ,, (@) and B, (d) are 


B,,(d) Z Tl, 


Ss W424 V24 (4) ‘> Wy Z14(¥,(4) - Vy (4) (6.5) 
ei Cx is Cy 
where T,, is given by (5.6), and 
5 = Wig Wop ZY, (4) 
B,,(d) = T, Be alate Ata (6.6) 


2k 


with qT, given by (5.10), and 
Dy (d) = 2, B, (d) for ke (8,9 U,). 


Remark 6.1: Note that for each new domain of interest, the 
variance estimator (6.2) requires two new sets of domain 
dependent residuals, e,,(d@) and e,, (d). Moreover, these 
are required for all of the units & in the second-phase 
sample s,, including units outside the domain. Variance 
estimation for domains can therefore be cumbersome. 


Remark 6.2: In practice the computation of estimated 
variances is seldom carried out as a double sum. For some 
important designs, the double sums reduce, after some 
algebraic manipulation, to single sum _ expressions. 
Examples of this occur for single sampling and for stratified 
single random sampling in both phases. Explicit algebraic 
developments for the variances have been given the former 
case by Sarndal et al. (1992), and in the later case by 
Hidiroglou (1995), and Binder, Babyak, Brodeur, 
Hidiroglou and Jocelyn (1997). 


7. APPLICATIONS WITH 
POSTSTRATIFICATION AT THE FIRST PHASE 


7.1 The Case of the Tax Sample at 
Statistics Canada 


An application of the calibration group approach in 
section 5 has been in use at Statistics Canada, in the two- 
phase design for sampling of tax records. The example is 
important because it provides the extension to two-phase 
designs of the traditional postratification technique as used 
in a single phase design. The sampling procedure, the post- 
stratification criteria, and the estimators are described in 
Armstrong and St-Jean (1994). We now show how these 
estimators are obtained as special case of the technique in 
section 5. The sampling design, in each phase, is stratified 
Bernouilli, carried out with the permanent random number 
technique. The two stratifications are based on different 
criteria. The realized sample sizes are random at each phase 
on account of the Bernouilli sampling. To offset the 
resulting tendency toward an increased variance, poststrati- 
fication is carried out at both phases of sampling. The two 
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poststratification criteria are different. We have in effect 
two crossing poststratifications. In the terminology of 
section 5, the first phase poststrata are the first-phase 
calibration groups. They are denoted as U,;7 = 1,..., 7, and 
the group membership of a unit k is indicated by the vector 
by A,, given by (5.1). The second phase poststrata are the 
second phase calibration groups. They are denoted as 
Sy j =1,...,J and the corresponding membership of a unit 
k is indicated by the vector A,, given by (5.3). 

The first-phase calibration is carried out using the 
information about the first-phase poststrata sizes, N.. In this 
survey design, there is no supplementary information, so 
Z,,=1 for all & in (5.5), yielding x,, =A,,. Specifying 
C,, = 1 for all k we obtain from (5.7) that 

27 NIN; (7.1) 
for all ke s,, where N,, = eal; , estimates the known 
first-phase poststratum count N,, and s,, = s,m U, denotes 
the part of the first-phase sample s, that falls in the first- 
phase poststratum U,. 

We arrive at the estimator of Armstrong and St-Jean 
(1994) by carrying out the second-phase calibration with 
x, =A,,, that is, we have z, = 1 for all kin (5.8). This is 
areduced x, -vector specification since it does not involve x, 
Specifying C,,=1 for all kes,,, and using (5.9) and 
(3.10), we obtain the overall calibrated weights 


aN Ne 
SE Sas ae (7:2) 
N,; 2j 
for all kes,, i. where 
me agp ip he ! 
NM =o} | M3 Ny = (7.3) 
i=] N,; i=l *P 
with MW, ,, =), Ww, and Ny, = . Heresy ise 


denotes ‘the patt of the second- bhibe aaiple 35 ae: falls 1 in 
ye second-phase poststratum s,,, and s,,,=U,Ms,); 

= 5, 1U,!s,,. It follows that the estimator of the total 
Y Fin? for a given domain U, is given by Y(d) = 
pare We &, ¥,(a), or equivalently as 


LIES SNe WN ‘ 
Vidic) pine a Dis, We Ye). 


TPAD; 


The estimated variance requires two types of residuals 
that are easily obtained from the general expressions given 
in Section 6. 

Alternatives exist to the reduced vector specification 
x, = A,, used for this design. We therefore examine what 
the estimator would look like under a full vector 
specification. For the first-phase calibration, as earlier, let 
X,, = Aj, Corresponding to z,, = 1 for all k in (5.8). The 
first-phase g-factors g,, are then given by (7.1). In this 


survey, information is available for assigning every unit 
kes, to one of the /x J cells formed by cross-classifying 
the two poststratification criteria. Therefore, the vector x, 
for the second-phase calibration can be taken as 


x, =A), eA, (7.4) 


This is a full vector specification in that it includes the 
first-phase information carrier A,,. Let us also specify 
C,,=1 for all k. Since (7.4) is of the form (5.8), the 
second-phase g-factors g,, are obtainable group-by-group 
from (3.9) with z= A The overall calibration factors 
are given by 


Vay 


— (7.5) 


for all kes,,,. Here, N,, is defined in (7.1), and Ne and 
Ny, are as fe (7. ay These overall calibration factors ae the 
product of two poststratified calibration factors. They are 
all positive and well defined, provided all sample cells s,,, 
are non-empty. Collapsing of small cells s,,. with relatively 
large non-empty cells is recommended for stable estimation. 
As pointed out in Remark 3.4, the overall weights obtained 
from (7.5) reproduce the known first-phase postrata sizes 
N,, whereas those obtained from (7.2) do not. 


Remark 7.1: Let us compare the calibration factors (7.2) 
and (7.5), resulting, respectively, from the reduced form 
x, =A,, and from the full form (7.4). Both factors are a 
product of two terms. The only difference lies in the second 
term. In both cases, the computation of the second term 
requires cross-classification information. That is, forevery k€ Si 
we need to identify the cross-classification cell ij to which 
k belongs. In the case of the reduced vector, the cell 
information is pooled across the first-phase groups. For the 
full vector, the cell information is kept separate, and one 
would expect the resulting weights to be more efficient. 


Remark 7.2: For the second-phase calibration, an 
alternative to (7.4) that also captures the information about 
the first-phase poststrata is to use 

xy = (Aig Ay ge (7.6) 
Note that with this specification, there is only one 


calibration group in the second phase, namely the whole 
first-phase sample s,. 


7.2 The Case of the Canadian Survey Employment, 
Payrolls and Hours 


The Survey on Employment Payrolls, and Hours (SEPH) 
covers all sectors of Canadian industry, and collects data 
on four principal variables: (1) salaries and payments to 
employees (denoted as z,; called payrolls); (ii) number of 
employees (z,;employment); (iii) hours worked by 
employees (y,;hours); and (iv) summarized earnings 
(y,; earnings). 
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SEPH (1994) uses a stratified two-phase sampling 
design. In the first phase, a sample of payroll deduction 
accounts is selected using a stratified Bernoulli sampling 
design with sampling rates within strata ranging from 10% 
to 100%. The strata are defined by region. A region is 
made up of one or more Canadian provinces. We describe 
the estimation for SEPH by considering one specific region. 

For units selected in the first-phase sample, two variables 
are transcribed, namely, payrolls (z,) and number of 
employees (z,). In the second-phase, a simple random 
sample is drawn. Data on the two variables of interest, A 
and y,, are collected for respondents in this sample. In 
addition, classification by industry and province is recorded 
for sampled units. The first-phase sample is poststratified 
by employment size groups. These are used as first-phase 
calibration groups and denoted U,;i = 1,..., 7. Their sizes 
denoted as N, for i=1,...,/ are assumed known. The 
vector x,, used for a first-phase calibration is of the form 
(5.5), where A,, is given by (5.1) and z,, = 1 for all k. We 
choose C,, = 1 for all k. It follows from (5.7) that the first- 
phase g-factors are 

Pi, SIN, (7.7) 
for all kes,,=s,U,, where NQe awe at, Slag: 

We now ‘turn to second- -phase calibration. It is Carried out 
using calibration groups Sis j=1,..., J, identified by the 
vector A,, given by (5.3). These groups are based on a 
province by industry classification. They are constructed so 
that: (i) there is a strong regression relationship between y, 
and the two z-variables, and that (11) there are at least 30 
observations within each group. The J (J + 2) dimensional 
x,-vector for the second-phase calibration is given by 

Xp = Ay @ Ales 269 234) (7.8) 

This specification requires (see Table 1) that every kes, 
can be classified into one of the J by J cells formed by 
See the calibration groups in the two phases. Let 

s,Ms,;3 8 P= SO 89, 8 Ms,,,- Also, the quan- 
Pave variable vanes 25 (payrolls) aad z,, (number of 
employees) must be known forekes;. The x ,-vector 
specification given by (7.8) is full, because it incorporates 
x,, =A,,- A reduced vector, ignoring the first-phase 
groups, would be x, = A>, ® (Z,,,23,)- 

As in Example 7.1, we have two crossing sets of 
calibration groups. 

Since the x,-vector (7.8) has the structure defined by 
(5.8), we used (5.9) to derive the second-phase g-factors for 
each group j =1,..., J. It follows from (7.8) that we are 
fitting, within each second-phase calibration group, a 
separate regression of y, on ¢,=(Z,,,23,)’ with an 
intercept that varies with the first-phase calibration group. 

Specifying C,, = 1 for all k, and using the additive form, 
ge = 21, + &,- 1, for the overall calibration factors, we 
obtain after some algebra 
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for all may where 


mete follows that we can write the estimator (6.1) as 
Y(@)= a phe 7 ¥,,@) with 


¥ (d) J GN AI, @ + co ery’ B. j(4)} 


Soi 


where 


(2) = ve id: “y(@)IN. 2ij 


and B. ja) =F; oat anes mad ee Gs pr): 

The form of ¥(d) is easy to understand. It is composed 
of J x J cell estimates ie (d), each reflecting the regression 
of y,(d@) on ¢,. Note that the two-dimensional slope vector 
B j (d) is obtained by pooling data across the first-phase 
groups. This is because the specification (7.8) of x, allows 
the intercept, but not the two regression slopes, to vary with 
the first-phase groups. 


8. CONCLUSIONS 


Two-phase designs have the advantage of being both 
economical and efficient. The present paper has provided 
a general theory for such designs when auxiliary 
information is present in each phase. 

Our goal is to incorporate this two-phase survey method- 
ology into Statistics Canada’s Generalized Estimation 
System (GES) described in Estevao et al. (1995). The GES 
is a general purpose program that currently handles domain 
estimation for arbitrary single phase designs and incor- 
porates auxiliary information in its estimation process. In 
this paper we have extended the basic principles of the 
GES, including the important idea of calibration groups, to 
two-phase designs. 
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We have illustrated the theory by showing its use in two 
current surveys at Statistics Canada. Given its generality, 
the theory has potential application to any two-phase 
sample design that uses auxiliary information. 
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Estimation in Sample Surveys Using Frames With a 
Many-to-Many Structure 


TERRI L. BYCZKOWSKI, MARTIN S. LEVY and DENNIS J. SWEENEY' 


ABSTRACT 


In sample surveys, the units contained in the sampling frame ideally have a one-to-one correspondence with the elements 


in the target population under study. 


In many cases, however, the frame has a many-to-many structure. That is, a unit in 


the frame may be associated with multiple target population elements and a target population element may be associated 
with multiple frame units. Such was the case in a building characteristics survey in which the frame was a list of street 
addresses, but the target population was commercial buildings. The frame was messy because a street address corresponded 
either to a single building, multiple buildings, or part of a building. In this paper, we develop estimators and formulas for 
their variances in both simple and stratified random sampling designs when the frame has a many-to-many structure. 


KEY WORDS: Imperfect frames; Correspondence errors; Building characteristics survey; Weighting; Simple random 


sampling; Stratified random sampling. 


1. INTRODUCTION 


This research was motivated by a study that was 
conducted for a utility company to estimate various popu- 
lation characteristics of the commercial buildings located in 
their service area. Budgetary constraints prohibited the 
development of a list of commercial buildings using 
canvassing techniques. However, a sampling frame consis- 
ting of street addresses (i.e., addresses at which a utility 
meter was located) was available. A drawback of this 
frame was that it had a many-to-many relationship with the 
target population of commercial buildings. That is, some 
units in the frame were associated with multiple target 
population elements, and some target population elements 
were associated with multiple frame units. In fact, several 
of the relationships between street addresses and com- 
mercial buildings were relatively complex. 

An advantage of this frame, however, was that total 
annual electrical usage was available for each street 
address. This resulted in a variable upon which the frame 
of street addresses could be effectively stratified. One of 
the important characteristics to be measured was the total 
commercial square footage. Studies conducted in the 
United States have shown that energy consumption is 
associated with both building size and building activity. 
For example, consumption is higher for buildings used for 
health care or food sales, and lower for buildings used for 
religious worship or public assembly. Also, energy 
consumption is correlated with building size even if the 
activity of the building is not known, as was the case here 
(U.S. Department of Energy 1992). 

There is a vast amount of literature dealing with 
imperfect sampling frames. Comprehensive summaries of 
this literature can be found in Kish (1965), Wright and Tsao 


(1983), and Lessler and Kalsbeek (1992). Another body of 
literature addresses multiplicity sampling in which the 
frame is constructed with a many-to-many structure by 
design. Here, frame imperfections are introduced in order 
to gather information more efficiently on rare occurrences 
in a population (Birnbaum and Sirken 1965, Sirken 
1972a,b, and Casady and Sirken 1980). Hansen, Hurwitz 
and Madow (1953a,b) present an estimator for use with 
sampling frames that have a many-to-one structure; 
population elements are represented multiple times in the 
frame. This estimator has also been adopted for use by 
National Agricultural Statistics Service (NASS) surveys 
(Musser 1993) with respect to the many-to-one frame. 
Bandyopadhyay and Adhikari (1993) developed estimators 
for a ratio, population mean, and population total when an 
unknown amount of duplication is present in the frame. 
But, these estimators are restricted to the simple random 
sampling case and the many-to-one frame. 

Two methods for estimating population characteristics 
using a frame with a many-to-many structure appear in the 
literature. First, the Horvitz-Thompson estimator (1952) 
provides unbiased estimates of population means and totals 
when varying probabilities of selection are present. Musser 
(1993) shows how to compute the correct inclusion 
probabilities for the population elements selected in simple 
random sampling from a many-to-one frame. However, 
Musser’s method can be extended to obtain inclusion 
probabilities for population elements in a simple random 
sample from the many-to-many frame as well. Second, 
Lavallée (1995) adapted the Weight Share Method, applied 
to longitudinal surveys, to the use of frames with a 
many-to-many structure. 

The purpose of this paper is to develop an alternative 
methodology for estimating population totals, counts, and 
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means when using sampling frames with a many-to-many 
structure under simple and stratified random sampling 
designs. Also, expressions for the variance of those 
estimators are derived. The results which we develop are 
not only of intrinsic interest, but expressions for the 
variance of the estimators are essential for the exploration 
of the effects of correspondence imperfections inherent in 
many-to-many sampling frames on the precision of these 
estimates. 

In section 2 we present these estimates in the simple 
random sampling without replacement (SRSWOR) case. 
We also describe the sampling methodology under which 
these estimators are applicable, state a result on bias, and 
develop expressions for their variance. 

In section 3 some of the results are extended to the case 
of stratified random sampling. In section 4 we develop 
conclusions, discuss limitations and make suggestions for 
future research. 


2. MANY-TO-MANY FRAMES FOR SIMPLE 
RANDOM SAMPLING 


It is useful to think of the relationship between the frame 
and the target population as a graph. The sampling units in 
the frame and the elements of the target population are the 
two sets of nodes; arcs link the sampling units to elements 
of the target population. These arcs reveal the structure of 
the relationship between the frame and the target popu- 
lation. Figure 2.1 shows an example of a frame and target 
population with a many-to-many relationship. There are 
7 sampling units in the frame, 6 elements in the target 
population and 10 links (arcs) between the sampling units 
and the elements of the population. Thus, a graph with 
13 nodes and 10 arcs represents this many-to-many 
structure. In this paper we assume that each population 
element is linked to the set of frame units by at least one arc 
and that each frame unit is linked to the set of population 
elements by at least one arc as well. 

Let us fix some notation. We find it convenient to iden- 
tify both frame units and population elements with their 
respective indices. Let F = {1,2,...,N} denote the set of 
indices for N sampling units, and let T = {1,2,..., WM} 
denote the set of indices for the M target population 
elements. An arc can be represented as an ordered pair; the 
first element of which comes from F, and the second from 
T. A population element k in T is said to be represented by 
sampling unit / in F, if it is linked to it by an arc denoted 
(jk). This means that when j is in the sample there is a 
nonzero probability of collecting data from population 
element k. We will denote by y, the measurement of 
interest on target population element k in 7. 

We now describe the sampling methodology under 
which the estimators developed herein are appropriate. 
Assume a SRSWOR of size n frame units is selected from 
F. The number of population elements included in the 
sample and measured, however, depends upon the nature of 


the association between the frame units and the population 
elements. 

Under SRSWOR, one of four scenarios can occur when 
a frame unit is selected. In the first scenario, a frame unit 
corresponds to one and only one population element (a 
one-to-one structure). Here the surveyor would simply 
collect the information concerning the single population 
element corresponding to the selected frame unit (see frame 
unit 1 of Figure 2.1). 


Sampling Target Population 
Frame Population Element 
Value 


Figure 2.1. An example of the correspondence between the 


sampling frame and the target population 


In the second scenario, several frame units correspond to 
one population element (a many-to-one structure). For 
example, in Figure 2.1, frame units 2 and 3 correspond to 
the single population element 2. In this case, if frame 
units 2 and/or 3 are included in the sample, information on 
population element 2 is collected. Thus, it is possible that 
population element 2 could appear in the sample, and as a 
record in the data set used to develop the estimates, up to 
two times. 

In the third scenario, one frame unit corresponds to more 
than one population element (a one-to-many structure). For 
example, in Figure 2.1 frame unit 4 corresponds to 
population elements 3 and 4. Here, only one population 
element (3 or 4) is selected using a randomization indepen- 
dent of the choice of frame units. Economics dictated this 
policy because data collection entailed lengthy personal 
interviews conducted by individuals with technical back- 
grounds. In this paper we assume that these randomizations 
are conducted using equal probabilities. But, any probabili- 
ties could be used (e.g., probability proportional to size) 
provided they are non-zero. 
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In the fourth scenario, a many-to-many structure exists. 
This is illustrated by frame units 5, 6 and 7 and population 
elements 5 and 6 in Figure 2.1. Since these complex cases 
are combinations of scenarios 2 and 3 above, the same 
sampling rules apply. For example, if frame unit 5 is 
selected, population element 5 is measured. If frame unit 6 
is selected, only one of population elements 5 and 6 is 
randomly selected and measured. 


2.1 Population Totals 


2.1.1 Estimator for a Population Total 


A many-to-many frame results in varying probabilities of 
selection. The estimators developed here involve a method of 
weighting, which is an extension of the estimator presented 
by Hansen et al. (1953a pp. 62-64). Their estimators and 
formulas for the variance of those estimators are restricted to 
the many-to-one frame structure. We extend those estimators 
to the many-to-many frame structure. 

For a SRSWOR of size 7, let us ..,J,, denote random 
variables such that J, =j if the i-th draw results in the 
selection of unit j from F. Hence Pr(J; = j) = 1/N forj in 
Fand 7=1,....,7. Let Kops, denote random variables 
such that K, = k if the i-th draw from F is followed by the 
selection of k from 7. We can now think of drawing a 
random sample of arcs {(J,K,),....J,,&,,)} which has a 
joint probability distribution determined by both the 
SRSWOR sampling design and the subsequent randomiza- 
tion (if required) to choose an element in 7. In particular, 
(J,K,) has marginal probability given by Pr{(J,K,) = 
(jk)} = (I/N)s_,, in which sj, is the conditional probability 
given by, s, = Pr(K, =k|J,=j). That is, S, is the condi- 
tional probability of selecting population element & in T 
given that frame unit 7 in Fis selected. These conditional 
probabilities will be referred to as arc probabilities and are 
illustrated for Figure 2.1 in Table 2.1. 


Table 2.1 
Arc Probabilities for Figure 2.1 
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For k in T, let U, denote the set of units in F that have 
arcs with a destination at k in T. Let s, = )jcu,S,- Using 
the language in Hansen et al. (1953a pp. 62-64) which 
motivated our development, we call s, the weight for 
population element k in 7. These weights for Figure 2.1 


appear in Table 2.2. 


Table 2.2 
Calculation of the Population Element Weights(s,) for Figure 2.1 


k 1 ” 3 4 5 6 
(s,) 1 2 1/2 1/2 7 1 


Arc probabilities and weights are used to compute the 
marginal probabilities of the K,, namely, Pr(K, =k) = 
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Yeu, 1/N)S,, = (1/N)s,, where k is in 7, and i = 1,...,n. 
Clearly, computing the arc probabilities is the key step in 
developing the correct weights for the data collected. It 
depends on properly ascertaining the graph structure for 
each sampling unit selected: a maximally connected (MC) 
subgraph. A connected subgraph is a subset of the nodes 
which are connected by a sequence of arcs. Maximal 
means that no node outside the subset is connected to a 
node belonging to the subset. There are 4 MC subgraphs in 
Figure 2.1. Each represents a different frame — population 
structure, namely, one-to-one, many-to-one, one-to-many, 
and many-to-many structure. 

To develop the estimators it is not necessary to know the 
structure for the entire graph. It is only necessary to know 
the structure of the MC subgraphs to which sampled frame 
units belong. 

We make the following observations about s, and 
S,:(i)s, = W indicates that population element k has W 
times the probability of being selected on the i-th draw as 
that of a population element with a weight of one; 
(M)jpocs, <N, k=, M4; Gn) O< Sys 1, 7€ U, and 
k =1,...,M; (iv) with respect to the one-to-many frame 
structure, s., = s,; (v) with respect to the many-to-one frame 
structure, s,, = 1 for all k3and (vi) ge Yj-1 Sq = NV. 

Now, let x,, ..., x,, denote the weighted values associated 
with the indices in 7. That is, let x, = y,/s,. Define random 
variables x x, , associated with draws 1 through n 


Kets? ca Kaa? ‘ 
from F, respectively, so that x, takes the value x, if 


K 
K, =k. Notice that we can write, ~ 
M M 
1 Vy Ww 
E(x,)=)) x,Pr(K,=)=—) —s,=—, (2.1) 
* 2, ; WEE 5, by 


where Y = i y, is the true population total. We take as 
our estimator of the population total based upon a 
SRSWOR from a sampling frame with many-to-many 
structure, 


sap. (2.2) 


Using (2.1) it follows that, 


E(Y) -+{ 43 «| 28S; E(,) =n =. 
i es Nn j=) : n 


Ne 
N 
We thus obtain, 

Theorem 2-1: The estimator (2.2) for a population total 
used in SRSWOR is unbiased. 

Using Figure 2.1, we now give a simple example of the 
use of this estimator. Suppose a simple random sample of 
four frame units was selected from the frame depicted in 
Figure 2.1 (2, 3, 4, and 7) which ultimately resulted in the 
selection of population elements 2, 4, and 5. The estimator 
of the population total, 
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The above estimator can also be used for a population 
count. We could estimate the size of the target population 
by letting y, = 1 for all k. In addition, we could estimate 
the number of population elements that possess some 
characteristic by letting y,=1 for those population 
elements with the characteristic of interest and y, =O for 
those without the characteristic. 


2.1.2 Variance of the Estimator for a Population 
Total 


First, some additional terminology and notation used in 
this section must be defined. Let P represent the set of all 
unordered pairs of arcs. We shall define an unordered pair 
of arcs as inadmissible if they cannot both be included in a 
sample. Formally let O = { 7 in F: more than one arc 
emerges from j }. Then R’ = {[ jk, jk’]:;¢ OQ and k#k’} 
is the set of unordered inadmissible pairs of arcs. Also, the 
set of unordered admissible pairs of arcs is the complemen- 
tary set R* = P\R’. 

To illustrate, consider Figure 2.1. The sampling metho- 
dology we employ requires that if frame unit 4 is selected, 
only one of population elements 3 and 4 can be included in 
the sample. Thus, {[4,3][4,4]} is an unordered inadmissible 
pair of arcs. The other unordered inadmissible pairs of arcs 
in Figure 2.1 are {[6,5][6,6]} and {[7,5][7,6]}. Thus, R’ = 
{[4,3][4,4], [6,5][6,6], [7,5][7,6]}. 

Theorem 2-2: The variance of the estimator (2.2) is, 
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where the double sum is over all unordered admissible pairs 
of arcs [ jk, j’k’]. 


Proof: 
a 2 
V(?) =E [23 = -y? 
nN j=) j 
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* 2 
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One can write 
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As mentioned in Section 2.1, we can think of selecting 
a sample of arcs which ultimately leads to the selection of 
population elements. Each arc (jk) is associated with a 
value x, = y,/s, of the population element k at its destina- 
tion. Thus, we can rewrite the double summation in (2.5) 
as a Summation over admissible unordered pairs of arcs, 
Ret 


2 © ks) ° 


We ikdié 
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Now, by virtue of the independence of the randomization 
and the choice of frame units: 


Pr(select [ jk, /’k’] in R*) = Pr(select { 7, j’} in F) 


1 
ae PS 


N\ ESE 
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Pr(select [ jk, j’k’] in R* 


select { 7, 7’}in F) = 


Substituting into (2.7) results in, 


1 
(x, Xp) Ty eve = 


ri (2.8) 
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Now substituting (2.6) and (2.8) into (2.5) yields, 


2(n- 1) y ss Ve Sie Vue Sip (2.9) 
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Finally substituting (2.9) into (2.4) gives the result (2.3). 
Equation (2.3) is a generalization of the formula 
developed by Bandyopadhyay and Adhikari (1993) for the 
variance of the estimate of a population total in the case of 
the many-to-many frame structure. It can be shown that 
(2.3) reduces to their formula when the sampling frame is 
restricted to a many-to-one structure. 
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Corollary 2-1: An alternative form of the variance formula 
in Theorem 2-2 is: 


OR Ca ae ee 
V(Y) =— —+ 
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Substituting the above expression into (2.3) provides the 
result. 

This formula is computationally simpler. Note that (2.3) 
requires that the term 


Ve Sj Ve Sirk 
rh aa Sp 


be summed over all unordered admissible pairs of arcs 
(R*), whereas this alternative formula only requires a 
summation over pairs of arcs that are inadmissible (R’). In 
most practical scenarios the number of admissible pairs of 
arcs will be far greater than the number of inadmissible 
pairs of arcs. 


2.2 Population Means 
2.2.1 Estimator for a Population Mean 


The estimator for a population mean presented here 
extends the estimator presented by Hansen et al. (1953a) to 
the many-to-many frame structure. 

Associated with the n draws from F, define random 
variables Sk and Zn = lisp, so that Sk takes value s, if 
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K, =k fori =1,...,n and k= 1,...,M. The estimator for a 
population mean, 


(2.10) 


2.2.2 Mean Square Error (MSE) of the Estimator 
for a Population Mean 


The estimator for a population mean is biased because it 
is a ratio estimator. But, it is well known that this bias 
becomes negligible for large samples and the bias is of 
order 1/n (Cochran 1977, p. 160). 

Our approximation of the MSE requires a summation 
over R**, the set of all ordered admissible pairs of arcs. 
Thus, if [jk, j’k’]JeR*, then both [j/k, j’k’]JeR** and 
[7k jkKeER.. 

To approximate the mean square error of the estimator 
(2.10), we use 
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Because Y is a ratio of two estimates, the well known 
approximation for the mean square error (Cochran 1977, 
pp. 32-33) can be used: 
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The first expectation in (2.12) is simply (2.9). Next, 
using (2.1) on the middle term in (2.12) results in 


Using (2.7) and (2.9) yields, 
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Note that the double sum is over all admissible ordered 
pairs of arcs. Therefore, 
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Finally, similar to (2.1), 
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Substituting these expectations into equation (2.12) yields 
Gibb) 


k=l Sy 


3. ESTIMATORS FOR MANY-TO-MANY 
FRAMES UNDER STRATIFIED 
RANDOM SAMPLING 


3.1 Introduction 


In this section we develop the estimators for a population 
count, mean, and total in the many-to-many frame case, 
when stratified random sampling is used. First, however, it 
is necessary to describe the sampling methodology under 
which these estimates are appropriate. Figure 3.1 provides 
an example that will be used throughout this section. 


3.2 The Sampling Methodology 


The same scenarios that were described in SRSWOR 
occur with respect to stratified random sampling. However, 
there are some additional problems that can arise in this 
case. 

Consider the building characteristics study that moti- 
vated this research. Assume that the population element 
value in Figure 3.1 is the building size, and the stratification 
variable is electrical usage associated with the street 
address. Because the frame of street addresses had a 
many-to-many correspondence with the target population of 
commercial buildings, the following problems arose in 
addition to those mentioned in Section 2.1: 

1. Mis-stratification: For example, frame unit (street 
address) 2 in stratum | appeared to be a large building 
because of the large electrical usage associated with it, 
and as a result, it was placed in the first stratum. The 
data collection revealed that the street address actually 
corresponded to two small buildings (population 
elements 2 and 3). In another example, frame units 5 
and 6 in stratum 2 appeared to be two small buildings in 
the frame, and were placed in the second stratum. But, 
the corresponding population element 7 is one large 
building with two street addresses. 

2. Crossover: For example, frame units 3 and 4 in stratum 1, 
and frame units 1 and 2 in stratum 2 each have a 
different street address and, as a result, appear in the 
frame to be two small and two large buildings. But, data 
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collection revealed that all four street addresses In the next section we develop estimators for population 
corresponded to only one building (e.g., a strip mall). In _ totalsand counts and show that these estimators are unbiased 
this case, not only is mis-stratification a problem, but not —_ despite mis-stratification and crossover. As is usually the 
all the frame units associated with a single building are | case, however, mis-stratification increases the variance of the 
included in the same strata. That is, one population estimates. Also, insofar as crossover induces mis- 


element (i.e., building) “crosses over” multiple strata. stratification, it too increases the variances of the estimates. 
Stratification Sampling Target Population 
Variable Value Frame Population Element Value 


Stratum 1 


Stratum 2 


Note: Frame units were placed in stratum 1 if the value of the stratification variable was 20 or more. 
Otherwise, the frame units were placed in stratum 2. 


Figure 3.1. An example of the correspondence between the frame and the target population in stratified random sampling 


Table 3.1 
Arc Probabilities for Figure 3.1 


Archjk — 1,1,1 Lee 123 1,3,4 1,4,4 2,1,4 2,2,4 79 8h) 2,4,5 2,4,6 DOs7 2,6,7 
Siik 1 1/2 1/2 1 1 ] 1 1 1/2 1/2 1 1 
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3.3. Population Totals and Counts 
3.3.1 Estimator for a Population Total 


The estimator developed here involves a method of 
weighting which extends the estimator presented in 
Hansen et al. (1953a, pp. 62-64) to stratified random 
sampling when using a many-to-many frame. 

Assume that F’ has been partitioned into Z mutually 
exclusive and exhaustive strata F’,, ..., F’, of size IN Aes 
respectively. Units in F,, will be denoted hj where 
j=1,..,N, and h =1,...,Z. Also, assume that a stratified 
random sample (without replacement) of size n =n, 
+... +m, has been drawn, where n, is the sample size from 
Eis te hJ,,...,J, denote random variables such that 
h de hj if the i-th draw from F’, results in the selection of h/. 
Let hK,,...hK, denote ender variables such that 
hh =k if the ith draw from F, is followed by the 
selection of k from T. If hjk denotes the arc that originates 
at frame unit hj in F, and terminates at k in 7, the marginal 
probability of the random arc (h/,, hK,) is given by, 
a1 
7 Ne 
in which s,., = = Pr(hK, = k| hJ, = hj) is an arc probability. 
Note that Siok is the conditional probability of selecting 
population element & in 7 given that frame unit 47 has been 
chosen. Assuming equal randomization probabilities, 
Table 3.1 shows the arc probabilities for Figure 3.1. 

Let W, denote the set of frame units hj in F that have 
arcs with a destination at kin 7. For example, W, = {(1, 3), 
(1, 4), (2, 1), (2, 2)}. Also, define the population element 
weight s, = Lnjew, Spire 

Table 3.2 contains the weights (s,) for all the population 
elements in Figure 3.1. The same observations concerning 
arc weights (S,, ,) and population element weights (s,) 
made in section 2.3.1 apply here. 


Pr{(hJ,, hK,) = (hj)} 


Table 3.2 
Population Element Weights (s,) for Figure 3.1 
k 1 z 3 4 5 6 7 
je dl 1/2 1/2 1+1+1+1=4 1+1/2=3/2 1/2 1+1=2 
For each h = 1,..., L andi = 1, My, let Xn be random 


variables such that XnK, = V,lSy if ie in T is selected as a 
result of the selection of some hj pia as 

The estimator of a population total for stratified random 
sampling, when using a sampling frame with a many- 
to-many structure is: 


a Mays OP ae 


Ay, i=) 


it 
pe Ye where re (3.1) 
h=l 


3.3.2 Variance of the Estimator for a Population 
Total 


Prior to developing the variance of estimator (3.1), some 
additional terminology must be defined. Let g,, denote the 


“stratum element weight”. This additional weight is 
necessary because of the potential of crossover. Let U,, 
denote the set of frame units in F,, that have arcs with a 
destination at population element k. For example, U,, 
{(2, 1), (2, 2)}. Then define q,, = VnjeUySnj i LO state. 
recall in Figure 3.1 population element 4 is “represented by 
two frame units in stratum 2, so q,, = )'rjeu,,5 ae: 

The weight q,, plays the role of s, when selection is 
restricted to F,. In fact, g,, =, when there is no cross- 
over. The probability of selecting any frame unit from F’, 
on step i out of n, is 1/N,. But, the probability of selecting 
a population element k represented by a frame unit in F’, is 
Prk =k) SOR iN, for all hae es 72, 

In order to develop the proof in this section, we 
introduce the term “apportioned stratum total” denoted by Y;,, . 

In effect, the values of the population elements that are 
represented by frame units in multiple strata are apportioned 
among those strata. Let VY, denote the set of population 
elements associated with frame units in F’,. In our example 
Vo= (1,2,.3,4) and V, = (4,5, 6, 7}. Let 


= San VI nn !S 5 


where y, is the value of population element 
k,k =1,2,...,é. When crossover is present, use of the 
weights q,, and s, apportion the measure y, among the 
strata in which population element k is represented. We can 
think of the use of these weights as distributing the 
population element value among the strata depending upon 
the number of times the population element is represented 
in a stratum relative to the total number of times it is 
represented in the frame. For example in Figure 3.1 Y, 
and Y, are calculated as follows: 


yy = 300) , 15(U2) , 502) , 652) _g9 5 
l 1/2 1/2 4 

ye = Boze LOGE Nee 20s OS ater 
4 3/2 1/2 2 


Note that )_, Y, = Y whether or not crossover exists. 


Theorem 3-1: The estimator for a population total (3.1) is 
unbiased. 


Proof: 
From (3.1), 
1 aN Ete 
“ h 
EP.) => 0 Ely.) (3.2) 
h=1 Ny, i=l 
Poreach 1 =10., 17, 
Vr 
Bene) = d) = Pr( AK, =k) = 
; keV, Sy 
Ve 1 Ved 1 = Sx 
eset Hera ee (3.3) 
keV, Sp Ny, Nykev, Sp Np 
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Substituting (3.3) into equation (3.2) yields E( Py = J, 

In the main result below we need the following notation. 
Let R, and R,’ be the set of admissible and inadmissible 
unordered pairs of arcs originating in F’,, respectively. 
Definitions of the above are identical to the corresponding 
concepts for the SRS WOR case, but restricted now to strata. 
Theorem 3-2: The variance of (3.1) is: 


L 
v(P,,) = » Sy (3.4) 
= 
where 
Band gd, ala Meo apnea) 
: My | keV, * Si (N, - 1) 
y,s S qd ¢ 
ye | RSajk Yk i | 53 Vr 3 - 85) 
thik,y’kER, \ Sk Sk keV, Sp 


Proof: First write, 


L 
aA +2( Bf ss P,P) —SOaYe rr): (3.6) 
h=1 hch' heh’ 

The last two terms cancel because 1g: and ve are 
independent. This follows since apportionment creates a 
new stratified population containing no crossover and 
samples chosen within different strata are independent. 
Thus, with 


h=1 h=1 het 
Now, 
seven Nall pth 
E(?,) = LE] xp, | = 
Ny, i=l 
N,| : 
Fs, oy, Eby) . 2E| be Xn nk (3.7) 
Nn, i=1 i<i’ 
Foreach i= 1, °...,n,, 
E(x )« = 
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Then, using equation (2.7) and (2.8), 


n 
25 5 Xx XnK, - f | Ela Xn.) = 


Ve Vy 


Sy Sy 


Hh, =i ).2 bee — Pr(hK, =k, hK, =k’) = 
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A OGN aid) 


clad oN SD VS Vee Syke 


ajkhy’k'eR, | Sk Sy 


(3.9) 


Equation (3.5) now follows from (3.8), (3.9), and the 
definition of Y,,. 

Using the method of Corollary 2-1, (3.5) can be 
simplified for computing purposes as follows: 


5) 2 
. (n, - 1) 3 Vink) _ 
(N,- 1) \ anes, Sy 
2 
Y YeSnik | _ 5 3 VS ik Vee Shik’ 
njkeA, \ Sx (hjk hik'ER, \ Seo Se 
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Sy 


keV, Sy 


where A, denotes the set of arcs that originate at frame 
units in F’,. 


3.4 Population Means 
3.4.1 Estimator for a Population Mean 


The estimator developed here for a population mean for 
stratified random sampling extends the estimator presented 
by Hansen et al. 1953a (pp. 62-64) to the case of a stratified 
random sample from a many-to-many frame. 

The estimator for a population mean when using 
stratified random sampling and a many-to-many frame is: 


(3.10) 


As in the SRSWOR case, the estimator for a population 
mean is biased because it is a ratio estimator. 
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4. CONCLUSIONS 


In this paper we have developed estimators for 
population totals, counts and means that are appropriate 
when the sampling frame has a many-to-many structure. 
We have focused on simple random sampling and stratified 
random sampling designs. 

We used the method of weighting described in this paper 
in a study of commercial buildings for which a stratified 
random sample was employed. In this study, for which the 
sampling frame consisted of street addresses, interviewers 
recorded any additional street addresses that pertained to 
the selected building. It was then determined whether or 
not these additional street addresses were listed in the 
sampling frame, and whether or not they were connected to 
other population elements (commercial buildings). In more 
complex scenarios, the interviewers sometimes resorted to 
schematic sketches of the buildings and labelling all the 
pertinent addresses. This allowed us to determine the 
structure of all MC subgraphs in our sample and to develop 
the appropriate weights s,. 

In addition, we developed formulas for the variance of 
some of the estimators presented in this paper. It should be 
noted that these variance formulas are population para- 
meters and do not translate readily into corresponding 
sample estimates. In fact, the authors are unaware of any 
optimal method for estimating the variances discussed in 
this paper. However, there are many computer intensive 
methods (balanced repeated replication, bootstrapping, efc.) 
for estimating variances in complex sample surveys (Wolter 
1985). It should be emphasized that when using our estima- 
tors, each of these variance estimation schemes aims at a 
common target: the variance formulas we have developed. 

Nevertheless, the usefulness of these variance formulas 
is in their application to the task of exploring the effects of 
frame imperfections, along with population characteristics, 
on the precision of estimation. Such an exploration, 
another future area of research, should result in recommen- 
dations and guidelines for the survey researcher on how to 
manage a frame with a many-to-many structure. That is, 
based upon frame and population characteristics, the survey 
researcher would be able to make strategic decisions 
concerning the options available: canvassing a population 
to remove correspondence imperfections, or using the 
estimators described herein. 

Another area of future research is a comparison of the 
precision of our estimators to that of other estimators, such as 
the Horvitz-Thompsonestimator. As noted in the introduction 
the Horvitz-Thompson estimator can be applied to sampling 
involving amany-to-many frame structure. An advantage of 
the Horvitz-Thompson estimator is that with properly 
identified first and second order inclusion probabilities, one 
can obtain both an estimate of a population characteristic and 
an unbiased estimate of its variance. In addition, the first order 
inclusion probabilities can be derived in a manner similar to 
Musser (1993) based only upon information from the MC 
subgraphs. However, these probabilities are very difficult to 


compute inacomplex many-to-many frame structure such as 
ours. Itis, however, relatively easy to calculate the necessary 
weights for our estimators. 
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Optimal Recursive Estimation for Repeated Surveys 


IBRAHIM S. YANSANEH and WAYNE A. FULLER' 


ABSTRACT 


Least squares estimation for repeated surveys is addressed. Several estimators of current level, change in level and average 
level for multiple time periods are developed. The Recursive Regression Estimator, a recursive computational form of the 
best linear unbiased estimator based on all periods of the survey, is presented. It is shown that the recursive regression 
procedure converges; and that the dimension of the estimation problem is bounded as the number of periods increases 
indefinitely. The recursive procedure offers a solution to the problem of computational complexity associated with 
minimum variance unbiased estimation in repeated surveys. Data from the U.S. Current Population Survey are used to 
compare alternative estimators under two types of rotation designs: the intermittent rotation design used in the U.S. Current 


Population Survey, and two continuous rotation designs. 


KEY WORDS: Recursive regression estimation; Composite estimation; Rotation designs; Rotation groups. 


1. INTRODUCTION 


We consider least squares estimation for surveys 
conducted on repeated occasions with partial overlap of 
sampling units. See Duncan and Kalton (1987) for a 
general discussion of different types of surveys and the 
objectives of such surveys. In this paper, we shall be 
concerned with rotating panel surveys, in which repeated 
determinations are made on some sampling units but not 
every unit appears in the sample at every time point. 

Theoretical foundations for the design and estimation for 
repeated surveys based on generalized least squares proce- 
dures were laid down by Patterson (1950), following initial 
work by Cochran (1942) and Jessen (1942). Least squares 
procedures have been considered further by several other 
authors. See, for instance, Fuller (1990), and the references 
cited therein. Least squares estimation for a fairly general 
class of repeated surveys was considered by Yansaneh 
(1992). Composite estimation is a procedure of estimation 
for repeated surveys which makes use of the observations 
from the current and preceding periods, and the estimator of 
level from the preceding period. Breau and Ernst (1983) 
compared various alternative estimators to a composite 
estimator for the U.S. Current Population Survey (CPS). 
Kumar and Lee (1983) did similar work using data from 
the Canadian Labor Force Survey (LFS). Wolter (1979) 
provided a general composite estimation strategy for 
two-level rotation schemes such as the one used in the U.S. 
Census Bureau’s Retail Trade Survey. Singh (1996) has 
proposed an alternative class of composite estimators. 
These authors assumed the unknown quantities on each 
occasion to be fixed parameters. Other authors, such as 
Scott, Smith, and Jones (1977), Jones (1980), Binder and 
Dick (1989), Bell and Hillmer (1990), and Pfeffermann 
(1991) considered estimation for repeated surveys under the 


assumption that the underlying true values constitute a 
realization of a time series. 

In this paper, we discuss estimation procedures for 
repeated surveys, under the assumption that the unknown 
true values are fixed parameters. The estimators are 
compared to the method of composite estimation currently 
used in the CPS. The paper is organized as follows: In 
section 2, we state some basic assumptions regarding the 
general class of repeated surveys considered in this paper. 
A description of the CPS method of composite estimation 
is given in section 3. The method of best linear unbiased 
estimation is discussed in section 4. In section 5, we 
present a recursive regression estimation procedure 
designed to reduce the computational complexity associated 
with best linear unbiased estimation. Section 6 is devoted 
to an application to data from the CPS. Alternative 
estimators and rotation designs are compared. 


2. BASIC ASSUMPTIONS 


In this section, we describe surveys of the type we will 
study. A rotation group is a set of individuals selected for 
the sample and observed for a fixed number of periods and 
in a fixed pattern over time. Assume that in each period of 
the survey, s rotation groups are included in the sample, 
where s>1 is fixed. Assume that the basic data from the 
survey can be organized in a set of elementary estimators 
(such as simple sample means and estimated totals) of the 
parameters of interest (such as population means and 
totals), where a set of elementary estimators is associated 
with each rotation group. For computational convenience, 
the data for p periods can be arranged in a pxs data 
matrix, denoted by H, in such a way that the observations 
on a rotation group appear in only one column. The total 
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number of elementary estimators is n = px s. We call the 
columns of H streams. Several rotation groups can appear 
in a stream. Assume that: 


(1) A given rotation group in a stream is observed over a 
period of total length m+ 1, and the observation 
pattern for rotation groups is fixed and is the same for 
all groups. 


(2) The design is balanced on time-in-sample. That is, of 
the s rotation groups included in the sample at a given 
time, one group is being observed for the first time, one 
is being observed for the second time, ..., one is being 
observed for the last time, where the last time is 
separated by m periods from the first observation. 


These assumptions are satisfied by surveys such as the CPS 
and the Canadian Labor Force Survey. See Yansaneh 
(1997) for an illustration of the 4-8-4 rotation scheme used 
in the CPS. 


3. THE CPS COMPOSITE ESTIMATOR 


In general, composite estimators combine recent esti- 
mator(s) and data from the current and preceding period(s) 
to form an estimator for the current period. With the CPS, 
six of the eight rotation groups observed at time ¢ were 
observed at time ¢- 1. We shall refer to these six rotation 
groups as continuing rotation groups, and the remaining 
two as incoming rotation groups. 

The composite estimator currently in use is determined 
by two parameters. The estimator is 

Oi ceili, M)Y, + .1,¢ + 9,11) +O (1) 
where, for the estimator currently used, Ii 0.4 and 
m, = 0.2, y,, is the elementary estimate of level obtained 
from the rotation group which is in its k-th time in sample 
at time t,y, = 87 is y,, 18 the basic estimator, defined as 
the mean of the elementary estimates based on the eight 
rotation groups observed at time 1, Big ~ 18 the composite 
estimator for time ¢ - 1, 5, 2, is an estimate of change in 
level, based on the six continuing rotation groups at time f, 
and 8, is the difference between the averages of the two 
incoming rotation groups and the six continuing rotation 
groups. Thus, 


Ore G bie, i ¥ Vermeil 
keS 


and 


where T = {1,5} and S = {2, 3, 4, 6, 7, 8}. The composite 
estimator used until 1985 contained only the first two terms 
on the right of (1). The third term was introduced for the 


purpose of reducing the time-in-sample effects appearing in 
the original estimator. The incoming rotation groups 
produce larger estimates of unemployed than do the 
continuing rotation groups. Therefore, the direct differ- 
ence 5, _- 18 influenced by the fact that the rotation group 
in its first time-in-sample has a larger expected value than 
that of the second time-in-sample. The time-in-sample 
effects do not cancel out in the difference estimate. The 
third term is an adjustment term which has the effect of 
reducing both the variance of the original composite 
estimator and the bias associated with time-in-sample 
effects. See Bailar (1975) or Breau and Ernst (1983) for a 
discussion of the bias of the pre-1985 composite estimator 
due to time-in-sample effects. We shall refer to the three- 
term composite estimator currently used in the CPS as the 
CPS Composite Estimator. This estimator has a variance 
close to that of the best linear unbiased estimator for 
monthly estimates of unemployment level. Let y, ,, 
i= 1,2,...,5, be the elementary estimator of the parameter 
of interest obtained from the rotation group which is in 
stream i at time ¢. The CPS composite estimator can be 
written as 


A 


8 8 
get » aC eA » Oy x,y Vit *™1,¢ (2) 
i= i= 


where k(i, t) = k defines the time-in-sample of observa- 
tion (it) as a function of the stream (i) and time (f). If 
A, = 1/8 and 5 =-1/6, and A, = 1/3, then Oi 1,45; 
and 


. (Leia) Ag 7, Ay wet Ag A, pafore keS 


ta Rae a) for keT 


/ 
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Pw 11a O12, 97 77 1, HB, ) 
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and yi = Vi J20 woo Vg)’: Then, 


A aA 


8, Pi et Pa Ve tre (3) 


Substituting in (3) recursively, we have, for an estimator 
initiated at time zero, 


t es 
Oe =Piy ye Ty '(P; + 1 Py)’ Vy (4) 
gat 


Equation (4) is an expression of 6, ~ as a linear function of 
current and past observations, where the weight of an 
observation declines as its distance from the current period 
increases. 
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4. BEST LINEAR UNBIASED ESTIMATION 


Suppose 6, i FE) ree) ts is the px1 vector of 
parameters of ‘interest, robes 8, = 1, 2,..., p, 1s the level 
of the parameter of interest at dime t Thus at time j, F is 
the current level of the parameter of interest. For example, 
in the context of the CPS, 0, might represent the population 
mean or proportion of unemployed at time 7. Our objective 
is to construct efficient estimators of the current level of the 
parameters. The change in level and average level over 
multiple periods of time are also of interest. 

The best linear unbiased estimator (BLUE) of the current 
level is defined to be the minimum-variance unbiased linear 
combination of the elementary estimators from the rotation 
groups available for estimation. It is possible in the process 
of computing the BLUE for the current level, to also 
compute the BLUEs for all periods using data available at 
the current time. 

Suppose that a repeated survey has been in operation for 
P periods and that s streams of data collected over p periods 
are available for estimation. Let y,; = (Y; 1, Yj. 7+» Vip)’ be 
the vector of p observations in the i-th stream at time ¢. Let Y 
be the data vector formed by the streams or columns of the p x 
data matrix H, arranged chronologically. Thus, Y, 
a : Vane Ba De ‘)’ is an nx 1 vector of observations, where 

Spc D: Let X = Jey o x be the 7 x p design matrix 
which relates the estimates in Y_ to their expected values 
in O,; ; where J.,, is the s x 1 vector of ones, J,,, is the 
identity matrix of order p, and ® denotes the Kronecker 
product. The linear model for Yo is 


as 


Aor U. (5) 


where U_ is the vector of error terms satisfying the 
assumptions E(U,,) = 0 and EU, U; ") = ays where ks is 
assumed to be a known, symmetric, "and positive definite 
matrix. By the Gauss-Markov Theorem, the BLUE of ey 
is 
a 7 tam -ly pol 
O aX Vis Xie XaV,,, YS. 


‘ 5 Aa 1. ryyrl =] 
The covariance matrix of ©, is }), = (X'V, X). 


5. RECURSIVE REGRESSION ESTIMATION 


Recursive estimation techniques have been found useful 
in situations where data do not all become available at the 
same time but rather accumulate over time, and the 
computation of optimal estimates based on all available data 
is impractical. See, for example, Odell and Lewis (1971), 
Sallas and Harville (1981) and references cited therein, for 
recursive algorithms for best linear unbiased estimation. 
Tiller (1989) presented a Kalman-filter approach to 
estimation of labor force characteristics using survey data. 

As described in Section 4, the direct computation of the 
BLUE becomes progressively more complicated as the 


33 


number of periods increases. We develop a recursive 
regression estimation procedure for repeated surveys that 
uses a judiciously chosen set of initial estimates, new 
observations of the current level, and the previous 
observations on the currently observed rotation groups to 
produce the BLUE of current level. 


5.1 Transformed Elementary Estimates and 
a Proposed Estimator 


Suppose a survey has been in operation for at least m 
periods and assume: 


(3) The rotation groups are independent. 
(4) The covariance structure of the observations is known. 


(5) The covariance structure of the observations in a 
stream is constant over time, and it is the same for all 
streams. 


These assumptions are used in the construction of a linear 
estimator. Assumption (3) will be relaxed for the 
computation of the variance of the estimator. Under 
assumptions (1) and (3), observations that are more than m 
periods apart are independent. At the current time, denoted 
by c, where c>™m, a set of s elementary estimates of the 
parameter 8. are observed. To construct the generalized 
least squares estimator, the s current observations are 
transformed so that they are uncorrelated with previous 
observations. After transformation, the expected values of 
the transformed observations are functions of 8, and the 
parameters for the m preceding periods. Assume that the 
BLUE of the vector of parameters for the previous m 
periods, and the mxm covariance matrix of these 
estimators, are available. Thus, at be c, we have: (i) m 
initial estimates om fon ae m2 9o-1) 5 (ii) the covari- 
ance matrix ye; ) of 0, at ea ind (iii) s independent 
observations on ine, Ss sudants at the current time. Let the 
transformed observations, denoted by z,,,i = 1, 2,..., 5, be 


Zic Tea eG am » Drcic),j Yie-/ (6) 
j= 


where bk, oj; are the coefficients such that z,. is 
uncorrelated with y, ._, for all j>0. By assumptions (4) 
and (5), the coefficients by _ are fixed over time. By 
assumption (3), Zz; . is nieoneird with all earlier iti 
tions. The expected value of z, , 1s 8, - 


Dye PKG, Whe c-j” 
Ale Paes BLO) 


5.2 The Recursive Regression Estimator 


Let 6, (1), h < t, denote the least squares estimator of the 
(scalar) parameter 8, constructed using data through time 
t; and let ,,) = ,,) (1), 8,(1))’ denote the least 
squares estimator of the vector of m parameters 
8, msi? «+» 9,> at time ¢ constructed using data through time 
t. Our objective is to construct the minimum variance 
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estimator for 0,, the current level of the parameter of 
interest using all data available at time c. A linear model 
for data available at the current time is 


Z, W®.0m+1) ss U, (7) 
where 
hho 0 
dy X), J, 
= = (01 Lon)’ es ah Al €-beie A and X), isan sxm 


aes whose entries are constant over time, and are 
ead of the coefficients b, , of (6). If Var{z, .} = 
=1,2,...,5, and Q.,, is the diagonal matrix with o as 
me aeoodal entries, hed the covariance matrix of Zi aris 
V.. = blockdiag{)’,_, (m)? Q,,}. It is assumed that 0?, 
L= 12, .65,S,are positive. 
The recursive regression estimator (RRE) of © 
defined to be the least squares estimator hie © 
model (7). Thus the RRE of @ 


c(m+1) 1S 
c(m+1) based on 
cqneiy 

One 7 VV, W)IWV'Z, (8) 


c(m+1) 


and the Comnuanes matrix of Oa isi Qian 
Wake W )x : 

The utility of the estimator (8) is its computational 
simplicity. At any fixed time ¢ in a repeated survey, all the 
information relevant to the problem of estimating 
0,,9,_1,++,9,,, can be obtained from a set of m recursive 
least squares estimates and the current observations. 

We now describe more fully the recursive regression 
procedure. At time ¢, we have 0, (m+) the RRE of ©, on+1) 
and its (m+1)x(m+1) covariance matrix ics 
Partition )' (m1) 28 


V ? 


12,¢ | 
pines 


Vist 


Dum | Vio 


where v,,, is the variance of bn UChr Anis Rene 
covariance matrix of (6, mA Dae 6 nt). cand Vio is the 
covariance between these two quantities. Observe that if 
9 /_,, 1S retained in the parameter vector and 64, is retained 
in the data vector, the estimator of 8,,, is unchanged (the 
estimator of 0, would, in general, be changed). This is 
because the estimator of the original parameter vector of a 
least squares problem is not changed if an observation 
whose expectation is equal to a single new parameter is 
added to the problem. Thus, to update the RRE for the next 
Boa we drop the initial estimate for the earliest period, 

it), from the data vector, and drop the corresponding 


Seed 9, from the parameter vector. The parameter 0, , 


is then added to the parameter vector. In this way, the 
dimension of the basic model matrix W of the estimation 
problem is kept constant over time. Thus in the class of 
repeated surveys considered in this paper, there is an upper 
bound on the computational effort required for the BLUE 
of the vector of parameters of interest. 

The model at time ¢ + 1 may be written as model (7), 


with dicey sot LyeZehy ee (Orr eG ti a Ola naan ape 
Seeded) = (Ozareg tee AOts Oo is and the covariance matrix 
of Ziemisalires blockdiag{ Yo» Qua} i ThesBLUEgot 
© and its covariance matrix are then obtained from 


t+1(m+1) 
the usual least squares formulas. The least squares 


estimators of the last m elements of ©,,,,,,;) are then used 
as the initial estimates in the model for the next iteration. 

The following theorem states that the covariance matrix 
of the vector of recursive least squares estimators converges 
to a positive definite matrix as the number of periods in the 
survey increases indefinitely. A proof is given in the 
appendix. 


Theorem: At any time /, let the sols of recursive least 
squares estimators Oven ) =O" Ps Vale ey: 6 tel be 
the ate of the vector *of ee Z 
(Oa OO) based on data through time ¢. Let ed 
be the covariance ‘matrix of 0, my Lét the assumptions ee 
through (5) hold. Also assume that the elements of V,, 
bounded for all 7, where V,, is the covariance matrix of ee 
n observations. Then, the covariance matrix ye es 
converges as t> ©; and the limit is an mx™m positive 
definite matrix. 


6. APPLICATION TO THE U.S. CURRENT 
POPULATION SURVEY 


6.1 The CPS Design 


The CPS is a monthly household survey conducted by 
the United States Census Bureau in cooperation with the 
Bureau of Labor Statistics for the purpose of providing 
national estimates of labor force characteristics such as the 
number employed, unemployed, and in the civilian labor 
force; and other characteristics of the non-institutionalized 
civilian population. The sample design of the CPS contains 
a rotation scheme that includes the replacement of a fraction 
of the households in the sample each month. For any given 
month, the sample consists of eight time-in-sample panels 
or rotation groups, of which one is being interviewed for the 
first time, one is being interviewed for the second time,..., 
and one is being interviewed for the eighth time. In other 
words, the interview scheme is balanced on time-in-sample. 
Households in a rotation group are interviewed for four 
consecutive months, dropped for the next eight succeeding 
months, and then interviewed for another four consecutive 
months. They are then dropped from the sample entirely. 
This system of interviewing is called the 4-8-4 rotation 
scheme, and is a special case of schemes described by Rao 
and Graham (1964). 
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6.2 Estimation and Variance Estimation Procedures 


We use estimates of the covariance structure of data 
from the CPS to compare alternative estimators and rotation 
designs. See Adam and Fuller (1992) and Fuller, Adam and 
Yansaneh (1993) for a detailed description of the con- 
struction of the model, the estimation of its parameters, and 
the estimation of the covariance structure of observations 
within a given rotation group for various characteristics of 
interest. Because the rotation groups come from the same 
set of primary sampling units, they are not independent and 
a component is included in the covariances to reflect the 
fact that the primary sampling units do not change. The 
RRE is computed with the eight current simple estimators 
and the 15 estimators for the 15 preceding periods. In 
computing the RRE, the covariances are used to create eight 
linear combinations of the current and the preceding fifteen 
observations that are uncorrelated with the preceding fifteen 
observations. Because of the primary sampling unit effect, 
these linear combinations are correlated with observations 
more than 15 periods in the past and in the same stream. 
Hence, they are correlated with the preceding estimators. 
The correlations with earlier estimators, ne aes 
are included in the covariance matrix when the estimator of 0, 
is constructed. However, because only the most recent 15 
observations are used, the resultant estimator of 8, is not the 
BLUE for current level. The calculated covariance matrix 
of (Gers ae Nee 6)’ is correct and, because the primary 
sampling unit effect is modest, it is felt that the estimator 
has efficiency close to that of the BLUE. 

We shall restrict attention to the estimation of various 
parameters for two characteristics of interest: Employed 
and Unemployed. For each characteristic, the parameters 
of interest are the current level and period-to-period change 
for up to 12 periods. The estimators considered for 
comparison are the CPS composite estimator; the RRE; and 
the BLUEs using 2, 3, 12, 16, and 24 periods, where the 
BLUE for p periods at time ¢ is the least squares estimator 
constructed using data from time ¢ - p + 1 through time ¢. 
Results are reported for BLUEs based on 12 and 16 periods. 
In following the practice of the U.S. Bureau of Labor 
Statistics for CPS estimators, the estimators are not 
modified as new data become available. Thus the estimator 
of change in level of a characteristic of interest between 
times ¢- 1 and fis not the best possible estimator given all 
available data. It is the difference between the best 
estimator at time ¢ based on data through time ¢ and the best 
estimator at time ¢- 1 based on data through time ¢- 1. 

We do not consider seasonal adjustment in this 
discussion. However, the estimation procedures presented 
can be extended to include seasonal adjustments. To 
compute the variance of a given estimator at a given time, 
the estimator is first expressed as a linear combination of 
all the observations available at that time. The variance of 
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the estimator is then computed as a function of the 
coefficients of the linear combination and the entries of the 
covariance matrix. 


6.3 Numerical Results and Discussion 


6.3.1 Comparison of Alternative Estimators 


The variances of the alternative estimators relative to the 
variance of the basic estimator of current level, for each of 
the characteristics of interest, are presented in Table 1. 
Recall that the basic estimator of the current level, denoted 
by y, is the simple mean of the eight elementary estimators 
obtained from the eight rotation groups observed at time f. 
That is, y,=8'Yi.y,,, and Var(y,) = 07/8, where 
o? = Var( y, , for alltandk. The basic estimator of change 
between two periods is the difference between the simple 
means for the two periods. 

The BLUE procedure based on 3 periods or more 
produces more efficient estimators of current level than the 
CPS composite estimator. In general, the best linear 
unbiased estimation procedure becomes more statistically 
efficient as the number of periods increases. For both 
characteristics, the results reveal that the best linear 
unbiased procedure based on 12 periods is uniformly more 
efficient than the CPS composite estimator for all 
parameters, except one-period change in unemployed. 
Recall that the estimator of change is not BLUE because the 
estimator is the difference of estimators constructed at time 
t and attime f- 1. Thus, the estimator called “BLUE” is 
best only for current level using the stated amount of data. 
The difference between the variance of the composite 
estimator of one-period change and the variance of the 
12-period BLUE of one-period change in unemployed is 
less than one percent. The gain in precision of the best 
linear unbiased estimation procedure for employed relative 
to the CPS composite estimator for current level is 22% for 
the BLUE for 12 periods, 28% for the BLUE for 16 
periods, 30% for the BLUE for 24 periods, and 33% for the 
RRE. The corresponding gains for unemployed are 2%, 
3%, and 3%. These results are a reflection of the nature of 
the autocorrelation functions of the characteristics. The 
autocorrelation function for unemployed declines much 
faster than that for employed. 

With the exception of one-period change in employed, 
there is an improvement in the efficiency of the estimation 
of change from using the alternative estimators instead of 
the CPS composite estimator. The gain in precision 
increases as the number of periods increases, reaching a 
maximum value at five-period change for both charac- 
teristics. The gain then decreases slightly. In the case of 
the RRE, the maximum gain in efficiency for estimated 
change is 64% for employed and 5% for unemployed. 
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Table 1 
Variances of alternative estimators relative to the variance of the basic estimator of current level 
Employed Unemployed 
Panne CPS BLUE for BLUE for Recursive Regression CPS BLUE for BLUE for Recursive Regression 
Comp. 12 periods 16 periods Estimator Comp. 12 Periods 16 periods Estimator 
Q) (2) (3) (4) (5) (6) (7) (8) (9) 
Current 
Level 0.862 0.704 0.672 0.650 0.947 0.924 0.918 0.918 
1-period 
change 0.511 0.457 0.437 0.432 1.070 1.077 1.073 1.073 
2-period 
change 0.813 0.646 0.613 0.604 1.361 1.345 1.338 1.338 
3-period 
change 1.065 0.763 0.724 0.711 1.528 1.481 1.473 1.473 
4-period 
change 7K) 0.830 0.800 0.784 1.645 1.569 1.563 1.562 
5-period 
change 1.363 0.880 0.847 0.829 1.691 1.614 1.607 1.606 
6-period 
change 1.390 0.910 0.873 0.855 1.708 1.637 1.628 1.628 
7-period 
change 1.388 0.930 0.884 0.865 1.710 1.646 1.637 1.636 
8-period 
change S55 0.932 0.884 0.860 1.701 1.645 1.635 1.634 
9-period 
change 12255 0.912 0.854 0.832 1.671 1.624 1.614 1.614 
10-period 
change 1.154 0.895 0.824 0.806 1.641 1.606 PS 1595 
11-period 
change 1.061 0.883 0.795 0.782 1.614 1.590 1.578 1.578 
12-period 
change 0.992 0.883 0.767 0.761 1.593 esa 1.563 1.563 


6.3.2 Comparison of Alternative Estimators and 
Rotation Designs 


The variances of alternative estimators under various 
rotation designs are given in Table 2. All variances are relative 
to the variance of the basic estimator of current level under that 
design. The efficiencies of alternative estimators of current 
level, change in level, and average level for multiple time 
periods are compared under the intermittent 4-8-4 rotation 
design andtwo continuous rotation designs. The continuous 
rotation designs are the 6-continuous scheme and the 8- 
continuous scheme. The 6-continuous scheme is the rotation 
scheme used in the Canadian Labor Force Survey conducted 
by Statistics Canada. Foreach period of the survey, the sample 
consists of six rotation groups, one rotation group in its first 
time-in-sample, ..., and one rotation group in its sixth time-in- 
sample. A given rotation group remains in the sample for six 
consecutive periods and then permanently drops out of the 
sample. See Kumar and Lee (1983) for more details about the 
design of the Canadian Labor Force Survey. In the 
8-continuous scheme, there are 8 rotation groups in the sample 
foreach period. A given rotation group remains in the sample 
for eight consecutive periods and then permanently drops out 
of the sample. 


We compare the performance under the various rotation 
designs using the BLUE of current level based on 36 periods. 
We call this estimator the “best estimator” because its 
efficiency is vitually the same as that of the RRE. For all 
rotation schemes under consideration, there are some 
improvements in the precision of the estimators of current 
level from using the best estimator relative to the CPS 
composite estimator. As seen in Table 2, the gain is highest for 
employed where, under the 4-8-4 rotation scheme, the 
variance of the best estimator of current level is only 76% of 
that of the CPS composite estimator. 

The precision of the estimators of change relative to the 
precision of the CPS composite estimator depends on the 
rotation design. From Table 2, we see that under the 4-8-4 
rotation scheme, there is some gain in precision, which 
increases as the lag increases. For employed, the variance 
of the least squares estimator is 85% of the variance of the 
CPS composite estimator for one-period change, 61% of the 
variance of the CPS composite estimator for six-period 
change, and 76% of the variance of the CPS composite 
estimator for 12-period change. (Compare columns (2) and 
(3) of Table 2.) 
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Table 2 
Variances of alternative estimators and rotation designs; the variance of the basic estimator of current level under each design equals one 
Employed Unemployed 
Best Est. Best Est. Best Est. : f ; 
SPF ney), CPS COMP che ar poarch 8 Conk) 6 odie) oe a a an c en 
(1) (2) (3) (4) (5) (6) (7) (8) (9) 
Current 
Level 0.862 0.653 0.761 0.759 0.947 0.918 0.944 0.938 
1-period 
change 0.511 0.432 0.395 0.434 1.070 1.073 1.003 1.051 
2-period 
change 0.813 0.604 0.559 0.619 1.361 1.338 1.250 LES 2 
3-period 
change 1.065 0.710 0.669 0.747 1.528 1.473 1.372 1.443 
4-period 
change 1.279 0.783 0.731 0.829 1.645 1.562 1.473 1.543 
5-period 
change 1.363 0.828 0.782 0.901 1.691 1.606 183 1.607 
6-period 
change 1.390 0.854 0.828 0.970 1.708 1.628 57g, 1.655 
7-period 
change 1.388 0.863 0.874 1.026 1.710 1.636 1.612 1.686 
8-period 
change 1.353 0.858 0.828 0.960 1.701 1.934 1.642 1.705 
9-period 
change 1.255 0.830 0.960 1.108 1.671 1.614 1.663 1.719 
10-period 
change 1.154 0.803 0.993 1.139 1.641 1.595 1.678 1/27, 
11-period 
change 1.061 0.779 1.021 1.165 1.614 1.578 1.688 1.733 
12-period 
change 0.992 0.758 1.046 1.186 1.593 1.564 1.696 EAST. 
12-period 
average 0.369 0.326 0.440 0.394 0.255 0.249 0.301 0.266 
12-change 
in averages 0.248 0.162 0.365 0.403 0.273 0.262 0.372 0.359 


For estimating 12-period averages in employed using the 
4-8-4 design, the CPS composite estimator is about 13% 
less efficient than the least squares estimator and, for 
estimating change in 12-period averages, it is about 53% 
less efficient, as can be seen by comparing the second and 
third columns of Table 2. For unemployed and the 4-8-4 
design, there are only modest gains in precision from using 
the least squares estimator relative to the CPS composite 
estimator, as shown in the sixth and seventh columns of 
Table 2. 

For estimation of 12-period change, 12-period average 
and change in 12-period averages, the 4-8-4 design is much 
superior to both continuous rotation designs for both 
characteristics. The continuous designs are generally 
superior for period-to-period changes for short periods. 


6.3.3 Internal Consistency 


In our analysis, we have constructed the best estimator of 
employed using only the past history of employed and the 
best estimator of unemployed using only the past history of 


unemployed. There is no formal reason not to include the 
past history of both employed and unempioyed in the 
construction of the estimators. However, Fuller et al. (1993) 
state that the estimated cross correlations are less than 0.10, 
suggesting that there is little gain from such inclusion. 

A method of constructing estimates of multiple 
characteristics that are internally consistent was suggested 
by Fuller (1990). In this procedure, estimates of employed, 
unemployed, and not-in-the-labor-force are constructed. 
Then these estimates are used as controls in a regression 
procedure to construct weights for the current observations. 
The weights can then be used to construct internally 
consistent estimates of any parameter of interest. The 
estimation procedure, including estimates of subdivisions 
of the labor force, is planned for implementation in 1998 for 
the CPS. See Lent, Miller and Cantwell (1996). 


6.4 Conclusions 


The main conclusions emerging form the variance 
computations in this section can be summarized as follows: 
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1. For all rotation designs and all characteristics under 
consideration, there are alternative estimation proce- 
dures with a variance of the current level smaller than 
that of the CPS composite estimator. 

2. For estimation of change under the 4-8-4 rotation 
design, the gain in precision of the alternative estimators 
relative to the CPS composite estimator increases as the 
lag increases, and peaks around the lag of minimum 
overlap. 

3. The intermittent 4-8-4 rotation design is inferior to the 
continuous rotation designs for short-period changes, 
but superior for current level, long-period averages, and 
changes in long-period averages. 

4. The CPS composite estimator is comparable to the RRE 
for unemployed for the estimation of one-period change 
and 12-period change. However, the recursive regres- 
sion estimation procedure is superior to the CPS 
composite estimator for other measures of change. 

5. The RRE is more efficient in estimating change in level 
at lags for which the CPS composite estimator is not 
targeted, for instance, lags of four months to six months. 
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APPENDIX 


Lemma 1. Let the assumptions of the theorem hold. 
Then the variance of the estimator of current level 0, 
converges to a positive number as the number of periods 
increases. 


Proof. If the means 0._,, Qo) +1 80,» Were known, 
then g,.,i=1,2,...,8 are unbiased peo of 0., where 
Bie =V1,03 82, =Ya,0~ P21 V2,e-1 7 O-1)3 -and g,,=y vor 

PM by On a m0: _j): avis Sapiiaely, 2ich, sare 
cede pendlent with variances 6,,i=1,2,...,8. We may 
write the linear model: 

Bead Wee (Al) 


where g = (S10 Bo 09 +1 Boo) 9 J, isthe Sx iecolumn Me 

of ones, and e is the s x 1 yector of errors with E(e) = 
and E(ee’) =V, = = Diag{oj, 05, Hee) ae "This the BLUE o 

0. for model (Al) has variance oF AG sr : . By assump- 


tion, the variances o i=1,2,...,s are bounded below and 
the quantity ().)_,6, )”' is a positive lower bound for the 
variance of the estimator of 0, [see Lemma 4.2.3 of 
Yansaneh (1992)]. The variance of the estimator of 0. is 
non-increasing as the number of observations increases, and 
hence, the variance converges to a positive 

number. 

Lemma 2. Let the assumptions of the theorem hold. 
Then the variance of the least squares estimator of each of 
the parameters 0, 0,_,,,,> ++: 9,_,, based on data through 
time f, converges to a positive number as f¢ increases. 

Proof. First, suppose at a fixed time Tt, at least m 
periods of observations are available both prior to t and 
after t. Define a transformation of the following form for 
the observations in each of the s streams at time T: 

m . 
Ui, ai. Tn i= Ong, T), tea C= yb where bx.2),0 = 0 and Ui, = 
uncorrelated with all observations preceding and 
a oan y,, in the i-th stream. Let the variance of wu, 
be Me = 1,2, ...,s. These variances are bounded below by 
eet We Conchide! as before, that there is a positive 
lower bound for the diagonal elements of the covariance 
matrix of the vector of recursive least squares estimators. 

Now, assume that at time f, we begin the sequence of 
estimation with the vector of recursive least squares 
estimators Ores = Ca ..,98,_,)’ based on data for the 
preceding m periods; and the vector of transformed 
observations 2; = (Z,,,...,2,,). Thus the linear model for 
the data at time ¢ is given by (7), with c replaced by ¢. The 
data vector Z, is of fixed dimension. Therefore, the 
covariance matrix i the BLUE of the vector of parameters 
Oreo) = pane srr Opayr 9)" 18 Laemery = VV, Waenebor 
computational convenience, we express W as ( Y ie 9s bs 
where J, ,, is the identity matrix of order m + 1, and Mis 
an (s- 1)x(m+1) matrix which is constant over time. 
Thus we have 


ms =I 1-71 =i 
Ne = Qu gnety tM Qo M) 


r (A2) 
Fa 1 (n+1) y Qi tim eiyMD, M2 5m 41) 
where 
: 2 2 2 
santa = blockdiag {))_s¢myF1}> Qo = diag {0},...,05}, 
and D, = Qo +MQ, 5 ,;)M'. Since the second term on 


the right hand side of (A2) is positive definite, we conclude 
that the first m diagonal elements of )’, (n+1) ate less than or 
equal to the original diagonal seaare ofu)it 1m This 
means that as ¢ increases, the variances of the estimators of 
OME» atey OREsn Oa are non-increasing. Since these variances 
are bounded below by a positive quantity, we conclude that 
the variances of the estimators of 0, ,,...,0,,,0 
converge to positive numbers as f¢ increases. 

Lemma 3. Let the assumptions of the theorem hold. 
Then, the variance of the least squares estimator of each of 
the parameters 0, 9,4; ~ 9, 92 «+0 9, — 9,_,, based on data 
through time ¢, converges to a positive number as ¢ 
increases. 


ORNS) i-—)| 
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Proof. First, we show that variance of the least squares 
estimator of 8.- 6,_, (where denotes the current period) 
converges as the number of periods increases by mimicking 
the arguments in the proof of Lemma 1. Also, arguments 
similar to those in the proof of Lemma 2 can be used to 
show that the variances of the least squares estimators of the 
Parameters O71 fens? olay O) = OR all converge as 
the number of periods increases. 

Proof of theorem. Since ny is a submatrix of the 
covariance matrix )’, (m+1) Of the least squares estimators of 
the full set of parameters 0,_,,, 9;-ms1,---, 9,_,,9,, at time f, 
it is enough to show that san converges to a positive 
definite matrix as f> »~. From Lemma 1 and Lemma 2, 
each of the diagonal elements of ioe converges to a 
positive number as f> ~. From Lemma 3, the variance of 
the least squares estimator of each of the parameters 
Ve Uleta vor, UUs, converges [0 a positive 
number as f~ ~. It follows that for each 7, 1 <j <™m, the 
covariance between the least squares estimators of 8, and 0; 
converges as f > ~ and hence the covariance matrix se 
converges as [> ©. 

Next, we prove that the limiting covariance matrix is 
positive definite. Let lim... iim) = Lym: It is enough to 
show that the variance of any non-trivial linear combination 
of the recursive least squares estimators 6 rf), 
j = 1,2, ...,m, is bounded below by a positive quantity. Let 
V»m be the lower bound of every linear combination of the 
observations with one of the coefficients equal to one. The 
bound is positive by the assumption that the elements of 
V." are bounded. 


Rsk 


Now, every estimator of the parameter 0,_, 
j =0,1,...,m 1s a linear combination of all observations 
such that the sum of the coefficients for the observations in 
the s streams at time ¢-j is one, and the sum of the 
coefficients for the observations in the s streams at any 
other time is zero. This is a condition for the unbiasedness 
of the estimator for time f¢-j. For the sum of the 
coefficients of the s observations at time ¢ - j to be equal to 
one, at least one of the coefficients must be greater than or 
equal to s~'. The minimum variance of any linear combi- 
nation with first coefficient equal to s 7’ is Se a Be 
Therefore; for j = 0, 1,:..., 7m, Var{6,_ it} ese Vin 

Now, consider an arbitrary, non-trivial linear 
combination of the recursive least squares estimators 
Be (@), 7 =0) Vee, 5.4 Biven Eby De -0Y; 6, (t), where, 
without loss of Generlliy: Nore This linear combination 
can be expressed as 

m 
d 8, _,(t) = 6,(t) » 1,6,,0 
(A3) 


=P Dieayiinys yd Sno 


i=] h=l 
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m 
Cit oD Sut) thee 
j=l 


i=1 
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where c,,i=1,2...,5, are the coefficients of y, , in 6 nr); 
and f, ity J j=1,..,m, are the coefficients of Mu in 
6,_(t), Prem "respectively. Therefore, Y_,c, = 1, 


antinpy4) ej) 79 for j=1,...,m. Thus 7. here 
E V/A, a-j)) = 1- That is, in the oo combination (A3), 
the sum of the coefficients for the observations y, ,, 
i=1,2,...,5, attime tis one. Therefore, at least one of the 
coefficients is greater eee or equal to s7!. Hence, 
Var{) 0¥,9,,(1)} aS ye and we conclude that pip is 
positive definite. 
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Estimation of Variance of General Regression Estimator: 
Higher Level Calibration Approach 


SARJINDER SINGH, STEPHEN HORN and FRANK YU' 


ABSTRACT 


In the present investigation, the problem of estimation of variance of the general linear regression estimator has been 
considered. It has been shown that the efficiency of the low level calibration approach adopted by Sarndal (1996) is less 
than or equal to that of a class of estimators proposed by Deng and Wu (1987). A higher level calibration approach has also 
been suggested. The efficiency of higher level calibration approach is shown to improve on the original approach. Several 
estimators are shown to be the special cases of this proposed higher level calibration approach. An idea to find a non — 
negative estimate of variance of the GREG has been suggested. Results have been extended to a stratified random sampling 
design. An empirical study has also been carried out to study the performance of the proposed strategies. The well known 
statistical package, GES, developed at Statistics Canada can further be improved to obtain better estimates of variance of 
GREG using the proposed higher level calibration approach under certain circumstances discussed in this paper. 


KEY WORDS: Calibration; Estimation of variance; Auxiliary information; Ratio and regression type estimators; Model 


assisted approach. 


1. INTRODUCTION 


The statisticians are often interested in the precision of 
survey estimates. The most commonly used estimator of 
population total/mean is the generalized linear regression 
(GREG) estimator. Let us consider the simplest case of 
the GREG where information on only one auxiliary variable 
is available. Consider a population Q = {1, 2, ...,N}, from 
which a probability sample s(s ¢ Q) is drawn with a given 
sampling design, p(.). The inclusion probabilities 1, = 
Pr(ies) and 1, € Pr(i and jes) are assumed to be strictly 
positive and known. Let y, be the value of the variable of 
interest, y, for the i-th population element, with which also 
is associated an auxiliary variable x,. For the elements, 
ies, we observe (y,,x,). The population total of the 
auxiliary variable x, X= a x;, iS assumed to be 
accurately known. The objective is to estimate the 
population total Y=)", y,. Deville and Sarndal (1992) 
used calibration on known population x-total to modify the 
basic sampling design weights, d, = 1/n,, that appear in the 
Horvitz-Thompson (1952) estimator 


A y; 
Vir=d, — = 4, (1.1) 
i=l Tl; i=] 
A new estimator 
n 
Ips 25 (1.2) 


was proposed by Deville and Sarmmdal (1992), with weights w, 
as Close as possible in an average sense for a given metric 
to the d,, while respecting the calibration equation 


(1.3) 


i=l 
A simple case considered by Deville and Sarndal (1992) is 
the minimization of chi-square type distance function given 
by 

(w, - ay: 


1.4 
ae (1.4) 


i 


where g, are suitably chosen weights. In most of the 
situations, the value of g, = 1. The form of the estimator 
depends upon the choice of g,. By minimizing (1.4) subject 
to calibration equation (1.3) we obtain weights 


[fl 


qx; 


D> a,x) 


Substitution of the value of w, from (1.5) in (1.2) leads to 
the traditional regression estimator of total given by 


(1.5) 


3 
M 
= 

x = 

* — 
= 
a 

x 


In this paper, the problem of estimation of variance of the 
regression estimator (1.6) has been considered at two 
different levels of calibration. The higher level calibration 
approach covers a greater variety of estimators than the low 
level calibration approach adopted by Sdrndal (1996). 
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Higher level calibration approach makes use of known total 
as well as known variance of the auxiliary character, 
whereas low level calibration utilizes only known total of 
auxiliary character. 

The section 4 has been devoted to study the stratified 
sampling design. The original stratum weights are calibra- 
ted which results in combined regression and combined 
ratio estimators in stratified sampling. The estimators of 
variance of combined regression and combined ratio esti- 
mators proposed by Wu (1985) are shown to be the special 
cases of the low level calibration approach. The higher level 
calibration approach has been shown to apply to a broader 
variety of estimators. 


2. ESTIMATOR OF VARIANCE OF THE GREG: 
THE LOW LEVEL CALIBRATION 
APPROACH 


Following model assisted survey sampling approach of 
Sarndal, Swensson and Wretman (1989, 1992), the Yates- 
Grundy (1953) form of estimator of variance of the 
estimator (1.6) is given by 


A bh 1 n n , 
Prollos)=5 Pulver ™e)? Qt) 
f=1 j= 

where D, a 1 m,,)/ Tj» 1#jand e,=y,- Bx, have their 
usual meanings. This estimator can easily be written as 


ns a ] Z 
Healings = oe wy D; (de; Gok i 


n 
i=l j=l 


n n 2 
w[x- ye i) -v,{x- yi ‘ (2.2) 
i=] it 


where 


DL d.q.x; 
i=] 


Say Di deradje \Mdig.x.e--d.qix.e)) (2.3) 


i=] j=l 


D, (4,9,x,e, - did xie |. (2.4) 


LT wR Td! PE) wey: 


The estimator at (2.1) has been discussed by Sarndal et al. 
(1989, 1992, 1996) on different occasions and covers a 
variety of estimators as discussed below: 


For simplicity, let us consider simple random sampling and 
without replacement (SRSWOR) design i.e., 1, = 1, = n/N 
and 1; = n(n-1)/N(N-1). Then we have following 
cases: 


Case 2.1: If q, = 1, then (1.6) reduces to the usual regres- 
sion estimator of total, Ycrec (say). Now if w, = d; in (2.1), 
it reduces to 


A 


P| 


NACL =) 
Sain 


ae (2.5) 


where f=n/N and e, =y, - ae Thus (2.5) denotes the 
usual estimator of variance of the regression estimator (1.6). 
Case 2.2: If q, = 1/x, then the estimator (1.6) reduces to the 
ratio estimator of total, Yratio (Say). The estimator (2.1) 
reduces to an estimator of variance of the estimator 
Yratio, given by 


V 


; INGER 2 [X17 
YG teeealie Rie ae f ML Cie (2.6) 
n(n-1) iA XxX 
where 
eav-(2}5 and X= Dor 
x nN j=) 


The estimator at (2.6) is a special case of a class of estima- 
tors of variance of the ratio estimator proposed by Wu 
(1982) as 


oP) 


for g=2. 


Case 2.3: Ifg, = 1 and w, is given by (1.5) then (2.2) and 
hence (2.1) becomes 


Maa (y. dels) % 


MODS 629, (x- 2) ox ZF 28) 


AG ewe ares 
where 
rs N= n n 
ie —_4A_4 YY ©,-2) Gems, ony 29) 
n i=l] j=l 
| < maa JL) ; 
i=] 
and 
ps wove oWNisitbe ous cere ie) (2.10) 
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Deng and Wu (1987) have defined a general class of 
estimators of the variance of the regression estimator as 


> (Pp et Jee J) 
Vealtau= A ey 3 i } Zeid) 
where e, =y,- Bx,. The linear form of the class of 


estimators (2.11) takes the form as 


Delia) 5 
nC) Due 


2 
oie ] ees } =| (2.12) 
34 2 x 


which is again similar to (2.8). Thus the low level calibra- 
tion approach considers estimators of variance of estimators 
of total 7.e., both ratio and regression methods of estimation. 
It is remarkable that there is no choice of g, which reduces 
(1.6) to the product method of estimation considered by 
Cochran (1963). Hence the estimation of variance of 
product estimator has not been considered. To look at the 
efficiency of such estimators, we consider an analogue of 
the general class of estimators for estimating variance of 
GREG by following Srivastava (1971) as 


‘| 44 (2.13) 
x 


where A.) is a parametric function such that H(1) = 1 
and satisfies certain regularity conditions. Following 
Srivastava (1971), it is easy to see that analogues of the 
general class of estimators (2.13) attain the minimum 
variance of the class of estimators proposed by Deng and 
Wu (1987) for regression estimator and Wu(1982) ratio 
estimator. We want to say here that if we will attach any 
function of the ratio X/X to the usual estimator of variance 
given by 


Pra (Fow)= 


> (Pp Nata. 
V5 (Ponsa )= an Cae ae 


NID < 
n(n-1) ta 
the asymptotic variance of the resultant estimator remains 
the same. In other words, the efficiency of the estimators of 
variance of regression estimator (GREG) of total obtained 
through low level calibration remains less than or equal to 
the class of estimators proposed by Wu (1982) and Deng 
and Wu (1987). The weights w, used to construct estimator 
of variance of GREG at (2.1) were obtained while estima- 
ting the population total and hence named as low level 
calibration weights for variance estimation. The next 
section is devoted to the higher level calibration approach 
where variance of auxiliary character is known. Several 
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new estimators are shown as special cases of the proposed 
higher level calibration approach. 


3. IMPROVED ESTIMATOR OF VARIANCE OF 
THE GREG: THE HIGHER LEVEL 
CALIBRATION APPROACH 


Here we apply the calibration approach to estimate the 
variance of GREG estimator at (1.6). The weights D; of 
Yates and Grundy (1953) for an estimator of variance given 
at (2.1) are calibrated such that the estimator of variance for 
the auxiliary variable has the exact variance. We consider 
an estimator of variance of GREG 


LS; Ss Q;, (wre;= w,e;) 


i=1 j=l 


Ve ( Ponea) 5 


(3.1) 


where ©, are the modified weights attached to the 
quadratic expression by Yates and Grundy (1953) form of 
estimator and are as close as possible in an average sense 
for a given measure to the D, with respect to the calibration 
equation 


1 n n bs 
5 Ye 2, (ax,- 42,9" = Vac ae 2) 
i=l j=l 
where 
k. 1 N N 
Vana = » (n,n, - m,)(d,x,-d.x,) 


i=l j= 


denotes the known variance of the estimator of the auxiliary 
total X¥( =)", x,) given by X,,, =), d,x,. To compute 
the right hand side of (3.2) we feed either information on 
every unit of the auxiliary character in the population, or 
only Vy, (xe) obtained from a past survey or pilot survey. 
The examples of a situation where information on every 
unit of the auxiliary character is known are the establish- 
ment turnover recorded from census or administrative 
records or Business Register (BR) or Australian Taxation 
Office (ATO). Known variance of the auxiliary character 
has also been used by Das and Tripathi (1978), Singh and 
Srivastava (1980), Srivastava and Jhajj (1980, 1981), Isaki 
(1983), Singh and Singh (1988), Swain and Mishra 
(1992), Shah and Patel (1996) and Garcia and Cebrian 
(1996). Singh, Mangat and Mahajan (1995) have reviewed 
classes of estimators of unknown population parameters 
making use of the known variance of an auxiliary character. 
The idea of adjusting D.. weights has also been discussed 
by Fuller (1970) through a regression type estimation 
procedure. For simplicity we restrict ourselves to the two 
dimensional Chi-Square (CS) type distance, D, between 
two nxn grids formed by the weights Q:, and D,, for 
b, J =1,2,2.,m, given by 


44 Singh, Horn and Yu: Estimation of Variance of General Regression Estimator 


(3.3) 


In most of the situations Q,, = 1 but other types of weights 
can also be used. We will show that the ratio type 
adjustment using known variance of auxiliary character is 
a special case for a particular choice of Q,,. Minimization 
of (3.3) subject to (3.2) leads to modified optimal weights 
given by 


Q, =D, + 72 
a 2, Di Qi (4%, 4%)" 
i=] j=l 
Vc(Sur) a D,(4,x,- 4,,)’ oe) 


for the optimal choice of Lagrange Multiplier 2, given by 


Valdan)- 5 YY Dylax,- 4x) 
we betade 


Lee D; 2, (4.x, - d;x,)* 


| (3.5) 
iu 
ee el 


Its proof is given in the Appendix. Substitution of Q:, from 
(3.4) in (3.1) leads to the following regression type 
estimator, 


Va (Pos) +B, Pave )- ieee | (3.6) 


where 


= 2 (say) (3.7) 
1 04 

Vg Kup) = 9 et Len dean d,x,)* and V a@Gs) 
is given in (2.1). Regression coefficient B, makes use of 
the known total_X of the auxiliary variable and hence can be 
treated as an improved estimator of regression coefficient 
by following Singh and Singh (1988). Under the higher 
level calibration approach, we have the following cases: 


Case 3.1: Under SRSWOR sampling design if g, =~,’ 
and Ona (a x a, xj)? are the weights attached at low 
level and higher level calibration approach, respectively, 
then the proposed strategy reduces to 


NOTH) Geen ee Ne 
ar ata eae || 4 3 (3.8) 


where s” = (n- 1)!” (x, - ¥)* is an unbiased estimator 
of So= (NO 1) ys oe 
Case 3.2: If g, = 1 and QO; =1Vi & j, then we have 


PSD) ys erPait, (e ae 


alae) n(n-1) jin | 


A 


o,(x- XP +9, (52-5?) G9 


where , and W, are given by (2.9) and (2.10), 
respectively, and 


Dee 
= Cis) 


DDD Care 


i=l j=l 


ae ve eesel 
(x, re x,)(e; if ey) we 


Without loss of generality, the estimators of variance of 
GREG given at (3.8) and (3.9) are neither members of a low 
level calibration approach nor of the class of estimators by 
Deng and Wu (1987). These estimators are members of the 
analogues of classes of estimators for estimating variance 
of GREG given by Srivastava and Jhajj (1981) as 


2; 


se es Nr(1-f)x Lee 
Pal Penae|=| ODS: of ae 


(3.11) 


where H(.,.) is a parametric function such that H(1, 1) = 1 
and which satisfies certain regularity conditions defined by 
them. Following Srivastava and Jhajj (1981) and Deng and 
Wu (1987), it is a class room exercise to see that the class 
of estimators at (3.11) remains better than the class of 
estimators defined at (2.11) and hence (2.13). 

A difficult issue in using (3.1) is how to get non-negative 
estimates of variance using calibration. The simplest way is 
to optimize the CS distance function (3.3) subject to 
calibration constraint (3.2) along with the conditions 
Q,20Vi, 7 =1,2,...,. While it is difficult to develop a 
solution to this problem theoretically, well known quadratic 
programming techniques can yield useful numerical results. 
Straightforward extension to using other distance functions, 
as discussed by Deville and Sarndal (1992) for instance, to 
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the two dimensional problem due to the indeterminate 
nature of the D; weights is not possible. It is open to others 
to propose new distance functions which guarantee the 
non-negativity of the weights. 


4. STRATIFIED SAMPLING DESIGN 


Suppose the population consists of Z strata with N, 
units in the A-th stratum from which a simple random 
sample of size n, is taken without replacement. The total 
population size N = ee NV , N,, and sample size n = ye Ripe 
Associated with the i-th rite of the h-th stratum there are 
two values Yi, and Xp with X, > O being the covariate. For 
the h-th stratum, let Wz, ‘/N be the stratum weights, 
i n,/N,, the sample Peon Mies rate ae Xe theis,= 
and x- sample and population means re Assume 
X= Die 1 W,X, is known. The purpose is to estimate 
Ya) toy W, VA ,» possibly by incorporating the covariate 
hinaiv x. The usual estimator of population mean Y is 
given by 


L 
Vee Wye. (4.1) 
h=1 
We are considering a new estimator, given by 
‘ L 
VIS = > W, y, (4.2) 
h=1 
with new weights W,. The new weights W, are chosen 
such that chi-square type distance, given by 
yS (4.3) 
h=) W, qT), 
is minimum subject to the condition 
L — 
yt Wa, =x. (4.4) 
h=1 


Minimization of (4.3) subject to calibration equation (4.4) 
leads to the combined regression type estimator given by 


L 
& Wh WX nYn Ors 
V5 yw =») W,x,| 45) 
h=1 
UATE y 
h=1 
for the optimum choice of weights given by 
(4.6) 


‘ Win Sale a 
ee et, 
22 
Do Pink hr 
h=l 
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If g, = 2 then estimator (4.5) reduces to the well known 
combined ratio estimator in stratified sampling. The well 
known estimator of variance of combined regression 
estimator is given by 


3 a ag | ee 
Viv gle de ue (47) 


where 


is the A-th stratum sample variance and Oy = Nees 
b(x,,- x) and b = a Wi, Ind a nl Done iW, 4, have 
their usual meaning. The lower level calibration approach 
yields an estimator of variance of the combined regression 
estimator as 


‘Big D: Bier 

h h 
FAV Ie Pars (4.8) 
=] W,, 
where 
Wiel = 
ae wel H,) 

Ny 


and W, is given by (4.6). If ¢, = a then (4.8) reduces to 
an estimator given by 


pee 7)? Wel-s, 
Ae Ne E | rm Wi (1-4) i) 52 Ss 


X s Nn, 


(4.9) 


which is a special case of a class of estimators for estima- 
ting the variance of combined ratio estimator given by Wu 
(1985) as 


Hely-( 2)" Bea 


(4.10) 


for g=2. The properties of variance estimators of the 
combined ratio estimator are also studied by Saxena, 
Nigham and Shukla (1995). In higher level calibration, a 
new estimator is given by 


(4.11) 


where Q2, are suitably chosen weights such that Chi-Square 
distance function given by 
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(4.12) 


is minimum subject to higher level calibration equation 
defined as 


De 2, S53, = V(X sy) (4.13) 
where, 
= = d-f,) 
Viz,=> W, Ae Ss 
h=] Ny, 
is assumed to be known and s;, = Gripes (Xp 7 an) 


is an unbiased estimator of S$? =(N, - 1)! ¥." (X,, - X,,)?. 
This procedure leads to a new estimator for the variance of 
the combined regression estimator given by 


* Bal cece) B,|V (&s)- Vz) | (4.14) 


W,*(1-f,) ome 3/ e W, Gray.) of 
haeroyy hx 


h=1 nN, nN), 


denotes the combined improved estimator of regression 
coefficient in stratified sampling and 


- - Ly 
pase Wie 
h=1 


ny, 


is an Ha Se estimator of V(x,). If q, =1/x, and 
0, = 1/s? hx» then estimator (4.14) reduces to a new estimator 
of variance of the combined ratio estimator given by 


2 = re 
all) al EY as 


Mh OS V(X s,) 


which is a ratio type estimator proposed by Wu (1985) for 
estimating variance of the combined ratio estimator but 
makes use of extra knowledge of the known variance of the 
auxiliary variable at the estimation stage. Several more new 
estimators can be constructed for new choices of weights 


q, and Q,. 


5. A WIDER CLASS OF ESTIMATORS 


If we define u = X/Y"7_, d,x, and v = V(X,,)/V (Xaq), 


then a wider class of estimators has been defined as 


a ee ci a wien’ 


where H (u,v) is a parametric function of u and v such 
that H(1,1)=1 and which satisfies certain regularity 
conditions. Then all estimators obtained from the following 
functions, 

Tact fee a) 

1+B(v- 1)’ 


AH(u,v)=1 7+ 0- 1) +B — 1) 


H(u,v) =u%v®, H(u,v) = 


and H(u,v) = {1 + a(w- 1) + B(v - 1)}"! are special cases 
of the higher level calibration approach, where a and £ are 
unknown parameters involved in the function H(u,v). 
Replacing these parameters with their respective consistent 
estimators in the class of estimators at (5.1) leads to the 
same asymptotic variance as shown by Srivastava and JShajj 
(1983), Singh and Singh (1984) and Mahajan and Singh 
(1996). The extension of present investigation to two phase 
sampling following Hidiroglou and Sarndal (1995) is in 
progress. 

The next section has been devoted to studying the 
performance of the higher order calibration approach 
through simulation. 


6. SIMULATION STUDY 


Under the simulation study, we have considered compa- 
risons of estimators of variance of ratio estimator as well as 
that of regression estimator. To avoid any kind of confu- 
sion, we have redefined the estimators considered for 
comparison as follows: 


6.1 Ratio Estimator 


We have compared the estimators of the variance of the 
ratio estimator, given by 


ape Naf) a Ne 
VAY, 2 ey ea 6.1.1 
‘| onc) naan y ( ) 
with the estimator, given by 
byw posae Si 
frees) =i (aes, = (6.1.2) 
Sy 


6.2 Regression Estimator 


We have also compared the estimators of the variance of 
the regression estimator, given by 


VA WHI 
MAR sofbboya yas 2 eile Sees SORE 15 
nin Ma 
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with the estimator, given by 
> (¢ oa Bee a emi 
V, (Peas) al ee + W, (Sy - 5.) 


where ,,7=1,2,3 have the same meaning as defined 
earlier. 

We have considered two types of populations viz. finite 
populations as well as infinite populations to cover almost 
all practical situations. 


(6.2.2) 


6.3 Finite Populations 


In case of finite populations, we have taken a population 
consisting of N = 20 units from Horvitz and Thompson 
(1952). The study variable, y, is the number of house- 
holds on i-th block and known auxiliary character, x, is the 
eye-estimated number of households on the i-th block. All 
possible samples of size n = 5 were selected by SRSWOR, 
which results in 


[*) = 15,504 


n 


samples. From the k-th sample, the estimator 


A EXE . oo _ NX 
feonols=*( with Y=—)i'y, 
xX NTA 


was computed. Empirical mean squared error of this 
estimator was computed as 


“ N 5 2 
MSE (feamo)=[ i e, vest Y| : 
For the k-th sample, the ratio type estimators of variance 
(i Wernalie h=1,2, 
given by (6.1.1) and (6.1.2) respectively, for estimating the 


variance of the ratio estimator were also obtained. The bias 
in the h-th ratio type estimator of variance was computed as 


BAP, (Peano) } 


A 


f 4 = b, eae - MSE Pevno) (6.3.2) 


and mean squared error was computed as 


47 
MSE { V, Wintealy 
nye nt 
(*) >, Pel¥exno) lt MSE (Pano) - (6.3.3) 


_ The percent relative efficiency of the estimator 
V, (Yratio) with respect to V, (Ypatio) was calculated as 


RE = 
MSE {/, (Peario) | x 100/MSE {7, (Frario)}- (6.3.4) 
The coverage by 95% confidence intervals 


CCIIP, (Pens | 
for h = 1,2 were calculated for h-th ratio type estimator of 
variance by counting the number of times the true 
population total, Y, falls between the limits defined as 


A 


i 


ratio lk * fy-n-1 (@) (6.3.5) 


V,,\Vranio) le 
These results were also obtained from all possible samples 
of size 6 and 7 and have been presented in Table 1. 

The same process was repeated for the regression 
estimator 


foecls=?+|Sx34/5 3] (x- %] 


of total obtained from (1.6) under a SRSWOR design. The 
biases, relative efficiency and CCI were obtained by using 
h-th estimator of variance of the regression estimator, 
V,, (Yorec)|, for h=1,2, given by (6.2.1) and (6.2.2), 
respectively. The results obtained have been presented in 
Table 2. In addition, it was observed that for 
n=5, 0.020% estimates of variance obtained from the 
estimator V, (Yoreg ) and 0.022% estimates obtained from 
the estimator V;(Yoreg) were negative. Similar results 
were observed for more natural populations given by 
Cochran (1963) and Sukhatme and Sukhatme (1970). Over 
all, second order calibration estimators perform better than 
first order calibration in case of the finite populations. 

In real life situations, the study variable and auxiliary 
variables may follow certain kinds of distributions like 
normal, beta or gamma etc. In order to see the performance 
of the proposed strategies under such circumstances, we 
generated artificial populations and considered the problem 
of estimation of finite population mean through simulation 
as follows. 


48 Singh, Horn and Yu: Estimation of Variance of General Regression Estimator 


Table 1 
Comparison of V, (7.46) with V, (Peed) for finite populations 
n B|?, (Peano) | B|?, (Yeano)| RE cell? ql feney| cel? al Pita | 
5 =P) N33) 217.01 166.57 0.93 0.95 
-141.92 102.00 115.06 0.91 0.92 
7 -99.34 58.60 109.23 0.90 0.90 
Table 2 
Comparison of V, tee 5] and V ‘| fee) for finite populations 
n BP, een B|?, (Poxec) | RE ccllP, ( eal cell¥. al i) 
5 - 328.49 -194.78 112.04 0.92 0.96 
6 PIS) Syn) - 136.34 103.02 0.90 0.93 
7 - 157.88 - 94.38 101.21 0.91 0.94 
6.4 Infinite Populations 
The size N of these populations is unknown. We genera- B { V, (y ek) a 
ted n independent pairs of random numbers y, and x, ee 
(say), 7=1,2,...,n, from a subroutine VNORM with 1 . Pils ~ MSE(# iy 
PHI = 0.7, seed(y) = 8987878 and seed (x) = 2348789 15,000 », n(Yranio) lk~ MSE(¥rano) (6.4.4) 


following Bratley, Fox and Schrage (1983). For fixed 
Sy = 50 and Sa = 50, we generated transformed variables, 


y, = 3.0 + 4S, (1- py, +p S, x; (6.4.1) 
and 
x,=4.0+S_x; (6.4.2) 


for different values of the correlation coefficient p. For the 
k-th sample, the estimator 


n 


ral tabeX . eee 
Feanole=F{ ¥}. with y=— )_ y, and 
x nN j=) 


was computed. Empirical mean squared error of this 
estimator was computed as 


a 1 15,000 F vf 
MSE Veicnic) - 15.000 », Pee Ps ae (6.4.3) 


For the k-th sample, the ratio type estimators of variance 


V, (vakere) les h=1,2, 


obtained from (6.1.1) and (6.1.2) respectively, for estima- 
ting the variance of the ratio estimator of population mean 
were also derived. The bias in the /-th ratio type estimator 
of variance was computed as 


and mean squared error Was computed as 


MSE{?, (Yeano)} 


15,000 


ley a » 1 


ELA hy een) | |, ~ MSE(Feario) | (6.4.5) 


The percent relative efficiency of the estimator 


V,( Sean) with respect to V (y Yeario) Was calculated as 


RE = 


MSE{P, (Fano) | x 100/MSE{P, (Feano)} (6.4.6) 
The coverage by 95% confidence intervals 
ccI|7, (Feario)| for # = 1,2 


was Calculated for h-th ratio type estimator of variance by 
counting the number of times the true population mean, Y, 
falls between the limits defined as 


Yratio |e F 1.96 VV, (Yranio)le: 


These results were obtained for samples of sizen = 60, 
80 and 100 for different values of correlation coefficient as 
presented in Table 3. 


(6.4.7) 
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The same process was repeated for the regression estimator 
Yorea |, =¥ * B(X - x) 


of mean obtained from (1.6) under a SRSWR design. The 
biases, relative efficiency and CCI were obtained by using 
h-th estimator of variance of the regression estimator, 


V, (Yorec) |, for # = 1,2, 


derived from (6.2.1) and (6.2.2), respectively. The results 
obtained have been presented in Table 4. We acknowledge 
that it is worth while studying the proposed strategy through 
simulation in more detail and its application in actual 
practice. The empirical study was carried out in 
FORTRAN-77 using a PENTIUM-120. 
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7. CONCLUSION 


Higher level calibration approach can be used if variance 
of the auxiliary character is known in addition to the known 
total of that character. The statistical package GES 
developed by Statistics Canada can be modified to obtain 
better estimators of the variance of GREG, useful for 
surveys where information on variance of auxiliary charac- 
ters is available or can be calculated. 
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Table 3 


nA A 


Comparison of V, Ae) with V, (Ye el for infinite populations 


nM p B 4 (Feanto) | B [?, (Feanio) | RE CCI 7, ae CCI [?, Cereal 
0.1 13.02 10.33 188.7 0.96 0.95 
0.3 8.07 6.35 192.6 0.97 0.95 
60 0.5 4,33 By Sh9/ 195.9 0.96 0.96 
0.7 NEH/7/ 1.37 197.9 0.97 0.97 
0.9 0.33 0.26 197.7 0.99 0.98 
0.1 Bra, 2.91 1232) 0.94 0.93 
0.3 2.06 1.84 123.0 0.94 0.94 
80 0.5 1.13 1.01 1227 0.95 0.95 
0.7 0.47 0.42 122.0 0.97 0.96 
0.9 0.08 0.08 119.1 0.98 0.97 
0.1 0.76 0.77 106.1 0.94 0.93 
0.3 0.49 0.49 105.8 0.94 0.94 
100 0.5 0.27 0.27 105.3 0.95 0.95 
0.7 0.12 0.12 104.4 0.96 0.95 
0.9 0.02 0.02 102.2 0.97 0.95 
Table 4 
Comparison of /, (area) with V, Vareel for infinite populations 
a p B ?, (Foxes) | B 2 (Fores) | RE cell, (Fores) cel, (Fores) 
0.1 10.12 8.42 177.6 0.98 0.95 
0.3 5.06 4.33 161.5 0.97 0.95 
60 0.5 BS 2.36 5225 0.95 0.96 
0.7 0.72 0.38 151.9 0.97 0.95 
0.9 0.13 0.10 147.7 0.99 0.97 
0.1 1.23 11 153.9 0.96 0.95 
0.3 1.03 1.01 143.5 0.98 0.94 
80 0.5 0.13 0.11 132.8 0.97 0.95 
0.7 0.07 0.06 121.6 0.97 0.95 
0.9 0.02 0.03 WGA 0.96 0.96 
0.1 0.65 0.57 136.1 0.95 0.94 
0.3 0.39 (0,33 Hi}5)5) 0.94 0.94 
100 0.5 0.13 0.13 129.6 0.95 0.95 
0.7 0.02 0.02 114.4 0.96 0.95 
0.9 0.01 0.01 iA 0.97 0.96 
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APPENDIX 


This appendix has been devoted to deriving the optimum 
value of QQ: as given in (3.4). The Lagrange's function is 
given by 


TLS, : Q,- Di) _ 
2 i=l] j=l D, Q, 
1 n n ss 
2d do, (d,x,-4x,P-Vyo (Rr) |. AD 


On differentiating (A.1) with respect to Q;, and equating to 
zero, we get 


OQ, =D, +2D, Q,(4,x,- 4;x,)*- (A.2) 
On putting (A.2) in (3.2), we get 
_ 1 = 
Vvg ta) es ps D, (4, tne d,x,)° 
We i=l j=l (A.3) 
1 n n 
ra, »y Dj Q,(4,x;- 4,x,)° 


On substituting (A.3) in (A.2), we get the optimum value 
of Q:, as given in (3.4). 


REFERENCES 


BRATLEY, P., FOX, B.L., and SCHRAGE, L.E. (1983). A Guide to 
Simulation. New York: Springer-Verlag. 


COCHRAN, W.G. (1963). Sampling Techniques, (second edition). 
New York: John Wiley and Sons. 

DAS, A.K., and TRIPATHI, T.P. (1978). Use of auxiliary 
information in estimating the finite population variance. Sankhya, 
40(C), 139-148. 

DENG, L.Y., and WU, C.F.J. (1987). Estimation of variance of the 
regression estimator. Journal of the American Statistical 
Association, 82, 568-576. 

DEVILLE, J.-C., and SARNDAL, C.-E. (1992). Calibration 
estimators in survey sampling. Journal of the American Statistical 
Association, 87, 376-382. 

FULLER, W.A. (1970). Sampling with random stratum boundaries. 
Journal of the Royal Statistical Society, 32, 209 - 226. 

GARCIA, M.R., and CEBRIAN, A.A. (1996). Repeated substitution 
method: The ratio estimator for the population variance. Metrika, 
43, 101-105. 

HIDIROGLOU, M. A., and SARNDAL, C.-E. (1995). Use of 
auxiliary information for two-phase sampling. Proceedings of the 


Section on Survey Research Methods, American Statistical 
Association, Volume II, 873-878. 


HORVITZ, D.G., and THOMPSON, D.J. (1952). A generalisation of 
sampling without replacement from a finite universe. Journal of 
the American Statistical Association, 47, 663-685. 


ISAKI, C.T. (1983). Variance estimation using auxiliary information. 
Journal of the American Statistical Association, 78(381), 117-123. 

MAHAJAN, P.K., and SINGH, S. (1996). On estimation of total in 
two stage sampling. Journal of Statistical Research, 30, 127-131. 

SARNDAL, C.-E. (1996). Efficient estimators with simple variance 
in unequal probability sampling. Journal of the American 
Statistical Association, 91, 1289-1300. 

SARNDAL, C.-E., SWENSSON, B., and WRETMAN, J.H. (1989). 
The weighted residual technique for estimating the variance of the 
general regression estimator of the finite population total. 
Biometrika, 76(3), 527-537. 

SARNDAL, C.-E., SWENSSON, B., and WRETMAN, J.H. (1992). 
Model Assisted Survey Sampling. New York: Springer-Verlag. 

SAXENA, S.K., NIGAM, A.K., and SHUKLA, N.D. (1995). 
Variance estimation for combined ratio estimator. Sankhyd, 
57(B), 85-92. 

SHAH, D.N., and PATEL, P.A. (1996). Asymptotic properties of a 
generalized regression-type predictor of a finite population 
variance in probability sampling. The Canadian Journal of 
Statistics, 24(3), 373-384. 

SINGH, P., and SRIVASTAVA, S.K. (1980). Sampling scheme 
providing unbiased regression estimators. Biometrika, 67, 205-209. 

SINGH, R.K., and SINGH, G. (1984) A class of estimators with 
estimated optimum values in sample surveys. Statistics & 
Probability Letters, 2, 319-321. 


SINGH, S., and SINGH, S. (1988). Improved estimators of K and B 
in finite populations. Journal of the Indian Society of Agricultural 
Statistics, 121-126. 

SINGH, S., MANGAT, N.S., and MAHAJAN, P.K. (1995). General 
class of estimators. Journal of the Indian Society of Agricultural 
Statistics, 47(2), 129-133. 

SRIVASTAVA, S.K. (1971). A generalized estimator for the mean of 
finite population using multi-auxiliary information. Journal of the 
American Statistical Association, 66, 404-407. 

SRIVASTAVA, S.K., and JHAJJ, S.K (1980). A class of estimators 
using auxiliary information for estimating finite population 
variance. Sankhya 42(C), 87-96. 

SRIVASTAVA, S.K., and JHAJJ, H.S. (1981). A class of estimators 
of the population mean in survey sampling using auxiliary 
information. Biometrika, 68, 341-343. 

SRIVASTAVA, S.K., and JHAJJ, H.S. (1983). A class of estimators 
of estimators of the population mean using multi-auxiliary 
information. Calcutta Statistical Association Bulletin 32, 47-56. 

SUKHATME, P.V., and SUKHATME, B.V. (1970). Sampling Theory 
of Surveys With Applications. lowa: Iowa State University Press. 

SWAIN, A.K.P.C., and MISHRA, G. (1992 ). Unbiased estimators of 
finite population variance using auxiliary information. Metron, 
201-215. 


WU, C.F.J. (1982). Estimation of variance of the ratio estimator. 
Biometrika, 69, 183-189. 


WU, C.F.J. (1985). Variance estimation for combined ratio and 
combined regression estimators. Journal of the Royal Statistical 
Society, 47(B), 147-154. 


YATES, F., and GRUNDY, P.M. (1953). Selection without 
replacement from within strata with probability proportional to 
size. Journal of the Royal Statistical Society, 15(B), 253-261. 


Survey Methodology, June 1998 
Vol. 24, No. 1, pp. 51-55 
Statistics Canada 


51 


Logistic Generalized Regression Estimators 


RISTO LEHTONEN and ARI VEIJANEN' 


ABSTRACT 


In this paper we study the model-assisted estimation of class frequencies of a discrete response variable by a new survey 
estimation method, which is closely related to generalized regression estimation. In generalized regression estimation the 
available auxiliary data are incorporated in the estimation procedure by a linear model fit. Instead of using a linear model 
for the class indicators, we describe the joint distribution of the class indicators by a multinomial logistic model. Logistic 
generalized regression estimators are introduced for class frequencies in a population and domains. Monte Carlo 
experiments were carried out for simulated data and for real data taken from the Labour Force Survey conducted monthly 
by Statistics Finland. The logistic generalized regression estimation yielded better results than the ordinary regression 
estimation for small domains and particularly for small class frequencies. 


KEY WORDS: Auxiliary information; Class frequencies; Generalized linear models; Labour force survey; Model-assisted 


estimation; Regression estimators. 


1. INTRODUCTION 


Consider the estimation of class frequencies of a discrete 
response variable in a sample survey. The number of 
individuals in a class equals the class indicator’s sum over 
the population, the total of the indicator. Therefore, the 
problem can be solved by methods designed for the 
estimation of population totals. To improve the accuracy of 
the estimation, a survey statistician often makes use of the 
available auxiliary data. If the expectation of the response 
variable can be assumed to depend linearly on the auxiliary 
variables as can be the case for continuous response varia- 
bles, it is advisable to use the generalized regression 
estimator (Sarndal, Swensson and Wretman 1992; Estevao, 
Hidiroglou and Sarndal 1995). Generalized regression 
estimation can improve the efficiency and reduce the bias 
due to unit nonresponse if the auxiliary variables correlate 
strongly with the response variable. 

From a modeler’s perspective, a linear model is quite 
restrictive and might not be the best choice for binary 
response variables, such as employment status of a person 
(employed, unemployed), or more generally for discrete 
response variables, such as a person’s status in the labour 
market (employed, unemployed, not in labour force). For 
such variables we introduce a class of logistic generalized 
regression estimators based on a multinomial logistic model 
describing the joint distribution of the class indicators. The 
motivation for the selection of this specific model type thus 
is similar to that used in the context of generalized linear 
models (McCullagh and Nelder 1989). 

The parameters of the logistic model are here estimated 
by maximizing a sample-based weighted loglikelihood, the 
Horvitz-Thompson estimator of the population loglikeli- 
hood function (Godambe and Thompson 1986; Nordberg 


1989; Skinner, Holt and Smith 1989; Sarndal et al. 1992, 
De div). 

As an application, we consider the estimation of the 
unemployment rate in the Labour Force Survey conducted 
monthly by Statistics Finland. Administrative records 
indicating whether a person is registered jobseeker in local 
employment office are available as register-based auxiliary 
data, and these records were merged with the survey data on 
individual basis using personal identification numbers which 
are unique in both data sources. The corresponding auxiliary 
variable correlates strongly with the survey measurement on 
person’s unemployment. Thus, improvement in efficiency 
and reduction of bias can be expected by making use of these 
administrative data in the estimation procedure. Additional 
auxiliary data (sex, age, regional data) were gathered from the 
Population Register. Also these auxiliary data were merged 
with the survey data on individual basis. 

The properties of the generalized regression estimators 
were studied by Monte Carlo simulation methods where 
SRSWOR samples were repeatedly drawn froma population 
constructed from the Labour Force Survey data. We use 
incomplete poststratification or raking based on a main 
effects ANOVA model. The experiments indicate that the 
logistic formulation yields better results than the linear 
formulation for small domains. We obtained good results 
also when there was only one continuous auxiliary variable. 

This paper is organized as follows. Section 2 defines the 
multinomial logistic model and basic concepts used. In 
Section 3 we introduce generalized regression estimators of 
class frequencies in a population and domains, and discuss 
the estimation of the model parameters by weighted 
loglikelihood. Variance estimators are presented. Monte 
Carlo experiments are discussed in Section 4. Conclusions 
are drawn in Section 5. 


' Risto Lehtonen and Ari Veijanen, Statistics Finland, P.O. Box 5A, FIN-00022 Statistics Finland, Finland. 
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2. MODEL 


Consider discrete m-valued random variables Y, 
associated with V elements & in a finite population U. We 
observe their realized values y, only in a sample scU of 
size n. Our goal is to estimate the frequency distribution of 
the y,’s in the population; in classification problems, we 
estimate the class proportions. Suppose we know the vector 
of auxiliary variables x, for every element in the 
population. We impose a multinomial logistic model 


_ exp{x/B,) 


YS exp{x{B,} 
r=1 


(@=11; 2, 21,77) (1) 


and assume that the Y,’s are conditionally independent 
given the x,’s. In the binary case, this is the model used in 
logistic regression. The parameter vector B is composed of 
vectors B,(i=1,2,...,m) with components Bij Vj =1, 
2, ....g). The parameters are assumed identifiable, that is, 
no two parameter values yield identical probabilities (1) for 
every k. This implies that the auxiliary variables 
xX4j (J = 1,2,...,q) are linearly independent. To avoid 
identifiability problems, we set B, = 0. It is straightforward 
to generalize (1) so that different auxiliary variables can be 
assigned for the m classes (Lehtonen and Veijanen 1998). 

The sampling design specifies the inclusion probabilities 
of population elements. The k-th element is drawn with 
inclusion probability 2, and elements k and p are simul- 
taneously in the sample s with probability Ty, > OG, =m): 
As usual, the sample membership indicators J, = {kes} are 
assumed conditionally independent of the Y,’s given the 
x,’s, but the inclusion probabilities may correlate with the 
auxiliary variables. 

Under unit nonresponse, if element & responds with 
probability 0, independently of the /,’s and Y,’s (pe U), 
then we pnb. tute m,9, for T,. Sone pending, Tey is 
replaced by z,, 0,0, when the elements respond indepen- 
dently of each mien 


3. LOGISTIC GENERALIZED REGRESSION 
ESTIMATION 


3.1 Definition of LGREG 


To estimate the frequency distribution of the y,’s, we 
define class indicators Z,,=J{Y, =i} with realizations 
z,, and estimate the totals t,=),<yz,,. The Horvitz- 
Thompson (HT) estimator of ¢, is FHT =) a, z,,, where 
the sampling weights are a, = 1/n,. Generalized regression 
es (GREG) is assisted by a regression model 

Z,,=%, ie +€,, with Var(e,,) = of, (Sarndal et al. 1992; 
foe et al. 1995). The parameter BY is estimated by 


2,...,m) (2) 


and the fitted values 2,, =x, B” are incorporated in the 
GREG estimator 


het Dae Fs PWIA eT ih Cee Oe ic 


keU kes 


The selection of a linear model for a GREG estimator (3) 
is fully justified for a continuous response variable. For 
binary measurements Z,,, a linear model might be un- 
realistic. Ordinarily, we would prefer a logistic model to a 
linear one. In the logistic formulation, the predicted value 
always lies in [0,1], whereas in the linear formulation, the 
predicted value can exceed these natural limits. If the 
probability of Z papel is close to O or 1, then the two models 
yield different results. Moreover, when there are m>2 
classes, it appears sensible to describe the joint distribution 
of the Z,,’s (i =1,2,...,m) by the multinomial logistic 
model (1). To apply the model (1) in generalized regression 
estimation, we estimate the expectations y,, = E(Z,,|x,; B) 
= P{Y,=i!x,;B} by 
Fiabe, 

1 + exp {x/B,} 


Ay, =PLY,, 


which depend nonlinearly on the auxiliary variables. We 
define a logistic generalized regression (LGREG) estimator 
by 


i Si pas fi, + OZ il) ie Wayne ION (4) 


keU kes 


The GREG and LGREG estimators (3) and (4) include 
a sum of predicted values over the population. However, it 
is not actually necessary to have information about the x,’s 
for every element in the population U. In GREG (3), it is 
enough to know the auxiliary totals ),.7;x,, because (3) 
can also be expressed in the form ee = rast + 
Oia Meg GG) Bas For the special case of complete 
poststratification, the information required in LGREG is 
similar to that needed in GREG. For other cases, such as 
incomplete poststratification, we cannot compute )’,.,, i, 
in (4) without knowing the frequency of each value of x, 
in the population. For example, if we have two discrete 
auxiliary variables, then in GREG we need the marginal 
frequencies, but in LGREG we need the cell frequencies. 

In addition to estimates for the whole population, 
estimates are usually calculated for subpopulations. The 
population U is partitioned into domains U4) ¢ U of size 
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Na): The set s of respondents is composed of corres- 
ponding subsets 5,4) =s U4, with n.,) elements. As in 
GREG estimation (Sarndal et al. 1992), we apply LGREG 


estimator 


fay = x f,; + we a, (Z,;~ Ay,)- (5) 


KES( 4 

These estimators are additive: yy} (ayi = Nay: If we 
combine two nonoverlapping domains d, and ws the 
LGREG estimate for d=d, ud, is tip es ba yt ba Jie 
Hence, yor (d)i =, for honoverlapping domains and 
Yen: 

In generalized regression estimation, an estimate (3) or 
(4) can be negative, when negative residuals coincide with 
large values of a,. Negative GREG estimates become more 
common, as the number of auxiliary variables increases 
(Chambers 1996). In LGREG estimation, in contrast, this is 
not so, because fi,, is bounded by the model formulation. In 
our experiments, LGREG estimates were negative only for 
small domains in certain cases. In many cases, LGREG 
estimate equals the sum of estimated expectations and then 
it is always positive (see Section 3.2). 

If the model (1) includes an auxiliary indicator variable, 
its total over the population is exactly estimated by 
LGREG. This calibration property is desirable in many 
applications. 


1 


3.2 ML Estimation by 2-Weighted Loglikelihood 


We estimate the parameter B in the model (1) by 
maximizing a m-weighted loglikelihood 


jin (Ree a 


Delte fir, “ol 1->> a + » I{ ¥,=i} log C4 


kes i=2 


(Godambe and Thompson 1986; Nordberg 1989; Sarndal 
et al. 1992, p. 517). In general, we maximize the likelihood 
function numerically by appropriate numerical methods 
such as a Newton-Raphson algorithm. 

It can be shown that for complete poststratification, the 
fitted values 2,, in GREG are equal to the estimates fi,, in 
LGREG. Thus, when there are no missing cells in complete 
poststratification, the GREG and LGREG estimators are 
identical (Lehtonen and Veijanen 1998). This does not 
hold for other models such as incomplete poststratification. 

The LGREG estimator (4) has two parts: a sum of esti- 
mated expectations over the population and an adjustment 
term )’,.,a,(z,,- fi,,)- It can be shown that if the model 
contains an intercept, the adjustment term vanishes and the 
frequency f, is estimated by )’,.,,fi,; (Lehtonen and 
Veijanen 1998). 
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In our experiments, we apply a ratio estimator 
REE eat ye Its variance is estimated by Taylor 
linearization ‘techniques (Sarndal et al. 1992, p. 179): 


V(R) = (PRC Tork 1) Cy +R . (6) 
ns 
where C ip? the covariance of i, and t ., 1s estimated by 
x A, Cue 
e ue ye DR SII Ng (7) 
kpes Thy, TM TL, 
In (7), €,, =2Z,, — fj, and A, » = Cov(l, 5 aE My — MT, - 


Similar derivations hold oC ‘the pone ponding domain 
estimators. 


4. EXPERIMENTS 


4.1 Details of Simulation Studies 


In all the simulation experiments, K = 1,000 samples 
were drawn from a population with simple random 
sampling without replacement (SRSWOR). Monte Carlo 
means and standard errors of the estimates were calculated 
from the simulated samples. The design effect for an 
estimator 3 ay Was calculated as a ratio aot estimated 


variances: Deff (ty = V Ae ay! a Ale Lae where 
Vt nD denotes the Monte Carlo variance estimate of 


the HT estimator (Lehtonen and Pahkinen 1996). We 
measured the overall accuracy of domain estimates by the 
mean absolute relative domain error over D domains and K 
samples 5; 


ay vile K 100 Ca, 65) 7 Ma,)i 
D p= K ja “a.)i 


In the GREG estimates (2), the variance was a constant 
o;, =o’,which cancelled out. For LGREG, domain 
frequencies were estimated by (5) and variances by (7). For 
GREG and HT, see Sarndal et al. (1992, p. 401). 
Confidence intervals for the frequencies were computed as 
if the class indicators were independent. At the nominal 
significance level of 95%, an acceptable coverage rate lies 
in [93.65%, 96.35%] for K = 1,000 simulated samples. 


4.2 An Experiment With Simulated Data 


To compare LGREG with GREG, we simulated a data 
set, in which the auxiliary variable XY was a continuous 
random variable uniformly distributed in (-3,3). The 
variable of interest, Y, representing three classes followed 
distribution (1) specified by x/B, =0, x, B, = 3X,- 1, and 
x, B, = -2X, for N = 10,000 elements (k = 1, 2,...,N). A 
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thousand samples of size n = 1,000 were independently 
drawn with SRSWOR. X, and X; were used as auxiliary 
variables. All the estimators appeared unbiased (Table 1). 
The variance estimates had empirical bias smaller than 3% 
and standard deviation smaller than 5%. 


Table 1 
The design effects (Deff) for class frequency estimators and the 
empirical coverage rates (CR) (%) of 95% confidence intervals for 
classes i =1,2,3 


Deff CR 
Method) ane mn. aa a A R 
ty t, t, t t, t, 
HT 1 1 1 95.2 95.3 94.7 
GREG 0.93 0.55 0.57 95.0 94.3 95.6 


LGREG 0.89 0.45 0.50 94.9 93.7 O53 


The best results were obtained by LGREG, probably due 
to the fact that the proportional frequencies of classes varied 
greatly over the range of the auxiliary variable. The 
probability of each class was such a function of the 
continuous auxiliary variable that a linear regression model 
did not fit the data well. 


4.3. An Experiment With the Finnish Labour Force 
Survey Data 


4.3.1 Constructed Population 


We studied the estimation of the unemployment rate 
using the Finnish Labour Force Survey (LFS) data of three 
consecutive months of the year 1994. The constructed 
population consisted of 33,329 individuals. From the 
Population Register we obtained, for each population 
member, age class (15-24, 25-34, 35-44, 45-54, and 55-64 
years), sex and region (three areas). A jobseeker indicator 
was obtained from the register maintained by Ministry of 
Labour showing which individuals were registered as 
unemployed jobseekers. The time lag in this administrative 
data source is about two weeks. It can thus be expected that 
the proportion of persons with changes in the actual labour 
market status is small within this short time interval. It 
should be noticed that the register-based jobseeker status is 
defined differently from the employment status measured in 
the Labour Force Survey. The survey measurement is based 
on a standard International Labour Office (ILO) definition. 
All these auxiliary data were merged with the survey data 
on individual basis. 

The nonresponse rate varied by jobseeker status so that 
among registered jobseekers the rate was 11.4% whereas for 
the others the rate was 7.6%. The probability of nonresponse 
was modeled by a logistic ANOVA model and the ML 
estimates of nonresponse rates (ranging from 2.9% to 22.8%) 
were used as a nonresponse model in simulations. 


For simulation experiments, we constructed an artificial 
population consisting of V = 30,835 persons. Employment 
status was defined by three classes: “employed”, 
“unemployed”, and “not in labour force” with population 
frequencies: f,= 17,373, 1, = 4,433, “and = 9,029, 
respectively. The unemployment rate was defined by 
R =1t,/(t, + t,) = 20.33%. As domains we used the cells in 
the crosstabulation of age classes, sex, and the register- 
based unemployment status. 

From the artificial population, K = 1,000 independent 
random samples of size n = 1,000 persons were drawn with 
simple random sampling without replacement. In each 
sample, nonresponse was simulated by the nonresponse 
model fitted to the original population. The response 
probabilities were then estimated from each sample by 
logistic regression with the same ANOVA model as in the 
nonresponse model. We multiplied each probability 2, by 
the estimated response probability. 

Three models were used to compare LGREG with 
GREG. The components of x, were dummies correspond- 
ing to age (5 classes), sex, region (3 areas) and jobseeker 
status. In incomplete poststratification, or raking, a main 
effects ANOVA model was based on classified auxiliary 
variables. We compared models with and without the 
jobseeker indicator. The third model also included a fourth- 
order polynomial of age. 


4.3.2 Results 


Incorporating no auxiliary information, HT estimators 
had usually larger variance than the generalized regression 
estimators (Table 2). Both generalized regression estimators 
based on a raking model with age, sex, and region yielded 
some improvement over the HT estimates. Much better 
results were obtained by models including the jobseeker 
indicator, which correlates more strongly (r = 0.83) with 
the ILO unemployment indicator than the other auxiliary 
variables. Thus these auxiliary data improve the efficiency 
of estimation (cf. Djerf 1997). 


Table 2 
Properties of unemployment rate estimates (R(%)) for the raking 
model (R) and the model including age polynomial (P), with (E) 
or without (N) the jobseeker indicator. SD denotes the standard 
deviation and CR (%) denotes the coverage rate of 95% 
confidence intervals 


Modél, «Methodon eltten {dias aeSDabal Dots 


CR MARDE 

HT 20.32 -0.0081 1.461 1 95s 35.28 
RN GREG 20.30 -0.0262 1.454 0.995 95.3 46.03 
RN EGREG 52.0 31h — 0.0229 45455 099 5ieen 95:3 45.93 
RE GREG 20.30 -0.0244 0.895 0.612 96.0 35.74 
RE LGREG 20.29 -0.0419 0.901 0.617 94.8 34.80 
PE GREG IS 620:3055-010259) 0:83 7 ae 0.607 95:65 meso :Al 
PE LGREG 20.29 -0.0421 0.896 0.613 95.1 34.76 


Survey Methodology, June 1998 


Table 3 
Mean absolute relative domain errors (MARDE) and mean 
coverage rates (CR) (%) of 95% confidence intervals 
for estimated class frequencies in domains with true frequency 
bani (i = 1, 2,3) (a) smaller than 100, and (b) at least 100. 
The model included the age polynomial 


MARDE CR 
NM =a ee SL 
@ "a2 “as ben han “ays 


(a) GREG 96:92) 67.36 121295" 88:2 77.8 84.6 


LGREG 80.28 67.20 104.05 83.9 76.5 Se 


(b) GREG GOS mee 2ol 14.35 94.1 85.9 O38 


LGREG OPS 25 3450 61429 939 85.4 9373 


The differences between GREG and LGREG were small 
at the population level (Table 2). LGREG was never 
inferior to GREG. Domain totals, especially in small 
domains, were more accurately estimated by LGREG than 
by GREG (Table 3). When the model included the age as a 
continuous auxiliary variable, the standard deviation of the 
unemployment rate estimate was smaller for LGREG than 
for GREG in 19 of 20 domains. Unfortunately, the 
confidence intervals obtained by LGREG were often too 
narrow due to small variance estimates (Table 3). 


5. SUMMARY 


We introduce a new approach to the model-assisted 
estimation of population class frequencies of a discrete 
response variable in survey sampling. Our logistic general- 
ized regression estimation (LGREG) is based on a multino- 
mial logistic model, which might be more realistic for class 
indicators than the linear model normally used in general- 
ized regression estimation (GREG). LGREG and GREG 
estimators yield identical results for complete poststratifi- 
cation, but differ for other models such as raking. As 
compared with GREG, LGREG usually requires more 
auxiliary information, not only the auxiliary totals. Never- 
theless, LGREG appears preferable to GREG when the 
class probabilities vary greatly over the range of continuous 
auxiliary variables and when we need estimates for small 
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domains, particularly in the presence of small class 
frequencies. 
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Confidence Intervals for Domain Parameters When 
the Domain Sample Size is Random 


ROBERT J. CASADY, ALAN H. DORFMAN and SUOJIN WANG’ 


ABSTRACT 


Let A be a population domain of interest and assume that the elements of A cannot be identified on the sampling frame and 
the number of elements in A is not known. Further assume that a sample of fixed size (say n) is selected from the entire 
frame and the resulting domain sample size (say n, ) is random. The problem addressed is the construction of a confidence 
interval for a domain parameter such as the domain aggregate 7, = )’,.,x,. The usual approach to this problem is to redefine 
x,, by setting x, = 0 if i¢ A. Thus, the construction of a confidence interval for the domain total is recast as the construction 
of a confidence interval for a population total which can be addressed (at least asymptotically in n) by normal theory. As 
an alternative, we condition on n, and construct confidence intervals which have approximately nominal coverage under 
certain assumptions regarding the domain population. We evaluate the new approach empirically using artificial 
populations and data from the Bureau of Labor Statistics (BLS) Occupational Compensation Survey. 


KEY WORDS: Bayes method; Conditioning; Establishment surveys; Simple random sampling; Stratification; Survey 


methods. 


1. INTRODUCTION 


In sampling from a finite population, we often are 
interested in the estimation of totals, means, or other 
quantities, for parts of that population, usually referred to as 
domains. Such domains are not explicitly listed in the 
frame, the number of items that will occur in the survey is 
not known in advance, and often enough, we do not even 
know the number of their elements in the population. For 
example, we might sample schoolchildren for certain 
medical problems, and then wish to know the mean blood 
pressure of those children who are underweight. The class 
of underweight children would constitute a domain. The 
only information we have as to whether or not a child is 
underweight is likely to be among the sampled children; if 
so, then this would be a case where the domain is not 
explicitly listed on the frame. 

Anessential part of the inference process is the estimation 
of the precision of our estimators; this is typically given by 
an estimated standard deviation, coefficient of variation, or 
confidence interval. The notion of a valid confidence 
interval underlies whatever measure of precision we use. All 
confidence intervals have, by construction, a stated 
“nominal” confidence level. A valid confidence interval is 
a confidence interval with actual coverage matching the 
nominal coverage. The actual coverage may be determined 
theoretically or by empirical work mimicking the practical 
circumstances in which the confidence interval would be 
used. If a standard deviation is not such as to give rise toa 
valid confidence interval, then the standard deviation needs 
to be regarded as misleading. 


In the case of estimates for domains, confidence intervals 
constructed along traditional lines can lead to serious under- 
coverage, a fact not always appreciated in the literature. 
We refer to this as the domain problem. The present paper 
addresses this problem by a somewhat complex methodolo- 
gy involving Bayesian ideas, which, however, leads to a 
rather simple practical solution, improving on current 
methodology. The main change in method lies in replacing 
the standard normal statistic used in the construction of 
confidence intervals, with a Student’s ¢-statistic having 
degrees of freedom that depend on the number and 
configuration of the domain items in the sample. 

We shall focus on domain totals and domain means for the 
two common cases of simple random sampling and stratified 
random sampling. In the case of simple random sampling, it 
turns out that standard methods are satisfactory for the mean; 
however, for the total, coverage can be lower than nominal 
but not usually worrisome. For stratified random sampling, 
confidence intervals for both the mean and the total pose 
serious difficulties with regard to coverage level. In this case, 
the new methodology is augmented by use of a well known 
approximation due to Satterthwaite (1946). Alternate 
approaches to ours, also using this approximation, may be 
found in Johnson and Rust (1993) and Kott (1994). 

An outline of the paper is as follows: In Section 2, to 
introduce ideas, we consider the case of the total in simple 
random sampling, using it to illustrate the standard 
approach for domain estimation, the coverage problem to 
which this gives rise, and the approach here taken to rectify 
the difficulty. Section 3 describes the extension to stratified 
random sampling. Section 4 states our conclusions. 


' Robert J. Casady and Alan H. Dorfman, U.S. Bureau of Labor Statistics, 2 Massachusetts Ave. N.E., Washington D.C., 20212-0001, U.S.A.; Suojin Wang, 
Department of Statistics, Texas A&M University, College Station, TX 77843, U.S.A. 
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2. THE CASE OF SIMPLE RANDOM SAMPLING 


2.1 Standard Method 


The standard approach to domain estimation is well 
described in Sarndal, Swensson, and Wretman (1992); 
Sections 3.3, 5.8, and Chapter 10) (henceforth SSW). Their 
approach is general. Here we paraphrase it for the case of 
simple random sampling, and, by mild extension, for 
stratified random sampling as well, and focus on the 
domain total. 

Let x, be the value of the characteristic of interest for the 
i-th (i = 1,2, ..., NM) element of the population and let A be 
a domain of interest. We shall consider only the case where 
the elements of 4 cannot be identified on the frame and the 
number NV , of elements in A is not known; the case where N , 
is known is fully treated in SSW. It is assumed that any 
element of A included in a sample can be identified. The 
problem is to construct a confidence interval for the domain 
total, T, = ye as based on a sample of 7 elements selected 
from the entire frame. 

Explicitly (as in SSW, Section 3.3) or implicitly (as in 
SSW, Section 10.3) the standard approach to this problem 
is to redefine x,, by setting x, = 0 if i¢A, which forces the 
population total T = ye my to be equal to T,. Thus, the 
construction of a confidence interval for the daneae total is 
recast as the construction of a confidence interval for a 
population total. In what follows it is assumed that the x,’s 
have been redefined as above. We shall also assume, here 
and throughout this paper, that n is sufficiently large and 
n/N sufficiently small that second order terms can be 
ignored. Define the additional population parameters, 


X =TIN = 
Se Pes eae X)/N = population variance, and 


population mean, 


Dp, = N,/N = proportion of population in A. 


Then 
(1) Ty = Win) YX) Sy ela TlIiN: $2 
Y-1 @&,- ¥) Mn - 1), and p, =n,/n (where n, is the 


number of sample elements in A) are unbiased for the 
corresponding population parameters, 

(2) E74) =T,, 

(3) var). = NAS in, 

(4) ~n(f, - T,)/(NS) -—2+N(, 1), and 

(5) s? is consistent for S*.’ 

It follows that vat, = T ,)/ Ns) __“ ,N(0, 1), so, when 
nis “sufficiently large”, appropriate values from the normal 
distribution can be used to construct confidence intervals 
for as as noted by SSW, p. 391. 

The proportion of the population in A © is 1 - p, and 
x,=0 for ic A‘; therefore, when P, 1S small and the 
values of the x,’s for i€ A are concentrated away from zero, 
the convergence in distribution in (4) can be slow. 


Consequently, the distribution of yn(T, - T,)/Ns can 
deviate from normal even for what are usually considered 
to be moderate to large values of n. The simulation study 
in Section 2.5 illustrates this. 

For the case of stratified random sampling, confidence 
interval coverage for domain quantities using standard 
methods can be poor. Dorfman and Valliant (1993) noted 
the problem in their study of wage distributions for domains 
consisting of workers in specific occupational groups. 
Preliminary empirical work by the authors indicated that 
supposed 95% confidence intervals for total workers and 
total wages for occupation based domains typically 
provided only 75% to 85% coverage even for a large total 
sample size (m = 353 establishments). These results are 
verified as part of the empirical work described in 
Section 3. Furthermore, their work indicated that the 
distribution of ity - T, was strongly dependent on the 
realized value of n,, which suggested that some type of 
“conditional” confidence interval should be considered. It 
seems desirable to establish methodology for the construc- 
tion of conditional (on n, or equivalently p ,) confidence 
intervals for T We which provide nominal, or near nominal, 
coverage regardless of the realized value of the domain 
sample size. Inference conditional on sample size is 
discussed in SSW, Section 10.4, but only for the case of 
known N,,; we are concerned throughout this paper with 
the case af unknown N,. 


2.2 Definitions and Notation 


We define the following parameters and estimators: 
Domain parameters: 
u4 = 7,/N, = domain mean, 


Po 2 
Sas Lied (x; 
ments in A. 


My Y/N ', = Variance of population ele- 


Domain estimators: 


N, = BaN, 
fy =¥74,x,/n, = T,/N, (only defined for n, > 1), and 
= 14, (x, - 4)’, - 1) (only defined for n, > 2). 


In what follows it is understood that n, > 2 (or equivalently 
Pp, = 2/n) unless specifically stated otherwise. At n, = 1 or 
0, it is preferable to supply an “insufficient information” 
tag, rather than attempt inference. The relationships given 
below follow directly from the definitions: 
T= Nps wands (Np ay, 
=P jt, ad X= po, 
2 2 
=p ~ Py)B4 + P4%s 
and 


O4- (1) 
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Also, it is straightforward to verify that 


(Vainj(f, . ra fa vn, (6, Py) + P4042, (2) 


where Z = ynP, (f,-p,)/o, Thus, conditionally on 
ine i is biased for T,, and if, for example, we assume an 
Undeduine normality, coal standardize (Yn /N (ft, - T ,) by 
the corresponding conditional variance, we will fel a 
non-central f-distribution with unknown non-centrality 
parameter proportional to yn U,(P,- P,), providing little 
basis for (conditional) sound inference. This is the problem 
which the discussions in the next sections attempt to 
address. 

We remark that in estimating the mean yp, by fi,, the 
bias is zero, and the problem of the preceding paragraph 
does not arise. This is the reason that, in simple random 
sampling, standard inference for means is sound, at least 
when the domain variates are normally distributed. 


2.3 General Methodology for Confidence Intervals 


Let 6 = (T,- ft ‘p!S#, , where sp is an estimator (to be 
specified) of the eeontinenal or unc eiciOnals variance of 
the total. Assume that the form of the conditional (on 2) D4) 
distribution function of 6, say H(- [Dawe (lye 07), is 
known where p,,., and o, represent unknown parame- 
ters. In order to construct a conditional equal tailed 
(1 - a) x 100% confidence interval (CI) for 7, we define 
an upper critical value 


cy= c,(0,p,,p,) =— inf\x| Hh |p,;p,)2 0/2) = 
~H(al2,p,3P,) 


where p, is considered fixed and the dependence on up, 
and oF is temporarily suppressed; a lower critical value, say 
c,, is defined in a similar manner. A conditional, equal 
tailed (1-a)x100% CI for 7, is then given by 
CI(1 - a) = (2, wu), where 


u= i +O, Sp and 2 = jk + CS; (3) 


At this point the obvious practical problem is that the 
critical values c, and c, depend not only on #, but also on 
the unknown parameter p,. One approach to this problem 
is to take a Bayesian tack and assume the parameter p,, is 
the realization of a random variable. Adjusting the notation 
to reflect the assumption that p, is stochastic, we replace 
H(x|B,;p,) by H(x| B,,p,) and have that 


Pr{6 < x| p,} = F(X| A,) 


Tal te |Px-P.)/(PalP,)8(P4)tPa A) 
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where h(f,) = [f(B4\P)8P4)4P 4 and g(p,) is the 
density of p,. It should be noted that as a consequence of 
our sampling scheme the distribution of nf,, conditional 
on p,, is Binomial (n,p,) so that f(p,|p,) is known. 
Under the Bayesian approach, the critical values are c, = 
é,(0,p,)=-F “*(a/2|f,) and-c, 2c, Cp) p= hr 
(1- a/2|,) so the upper and lower limits for a 
conditional (1 - a) x 100% CI for 7, are 


A 


u=T,+c,sp and = T,+ Cy Sp (5) 


For the purposes of our current research, we assume that the 
prior distribution g(p,) is N(L, aos ) with H, and Sp, 
to be specified, with the cndersarding that Op, is 
sufficiently small that p, lies between 0 and 1 with near 
certainty. The normality assumption is made for mathe- 
matical convenience. It also captures notions we may have 
of degrees of closeness to, and symmetry about, pu ye For an 
empirical Bayes approach, we use Hp, = = p,; we consider 
several possible alternatives for 6% : discussed in detail 
below. Our experience indicates that the normality 
assumption is not crucial; rather, it is primarily a matter of 
convenience. 


2.4 Confidence Intervals Under Normal 
Assumptions 


.To proceed further we assume that within the domain 4 
the x, are distributed NM(u wo aah In practice, this 
assumption may not be met. Nonetheless, it leads to 
suggested modifications that will not at any rate give lower 
coverage of confidence intervals than the standard 
approach. Combining this assumption with earlier results, 
in particular equation (2), and ignoring lower order terms, 
we have 


(a) [Y/n(T,- T,)/n| B,,p,] is distributed 


N(nw4(By- Py)» BySa)> 
62 
(b) \(np,- 1)— =i, 07402 is distributed y 2 (np, - 1), and 
O% 


(c) the conditional random variable in (b) is stochastically 
independent of the conditional random variable in (a). 
Consider 6, = (7, - T,)/(N6,/5,/¥n), which utilizes 
the conditional variance of 7, as the standardizing term. It 
follows immediately from (a), (b) and (c) that, conditional 
on (p,,p,) the random variable 6 , is distributed as a 
non-central ¢ with np, - 1 =n, - 1 degrees of freedom and 
non-centrality parameter 


d= Vny4(By- Pa)! By» 
with 
Y4 = Hy! 94. 
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Thus, we have specified the conditional distribution 
function H(-| p,,p,) of 6,. As f(p,|p,) and g(p,) 
have been previously specified, it follows that F(-| 6,) in 
(4) is well-defined although extremely cumbersome to 
calculate. The dependence on p, and 64, through y OF 
should be noted. 

Although F(-|,) as given above can be used to 
determine the critical values, they are extremely difficult to 
calculate. A relatively simple approach, given in the next 
paragraph, provides a close approximation to the critical 
values. We have verified the closeness of the approxi- 
mation by computing the exact values for selected cases 
using large scale simulations. 

Adoption of a locally uniform prior on p, leads to the 
approximate posterior distribution p, ~ N(p,, var(p,)) 
and we could approximate var(p,) by p,(1- p,)/n. We 
adopt the slightly more flexible prior p, ~ N(u, o, ), and 
empirically choose up =),, with several possibilities for 
o, that will be specified below. It follows from Appendix 
A that [A| 6,] is distributed approximately as a normal 
with mean zero and variance Vv, (1-p,)/C1 +w,), where 


A A ? 
Wa apy) - P,)Ino, 


Then, from the result in Appendix B, conditional on p Ps 


(P42 


is distributed as a central t with n, - 1 degrees of freedom. 
Let ¢, ses be the (1 - a/2)100% percentile of this 
distribution. The upper confidence limit u, defined in (5), 


is given (approximately) by 


u=T,+ N64/p,/nx 
(vit = paet v,) il (I + even oe (6) 


AS 6, is conditionally unbiased for o, and ay - Gilg 
is » Conditionally, unbiased for We we use %4 = 

2 
(a4, - 61/n, )/6), to estimate y,. Substituting q, for y%, in 
(6) yields 


where s? is defined in (1). 


It remains to choose y,. We note that 7 is strictly 


decreasing as y,, increases and 


Ae eS 3 
he Le: Teal co oA =u as wy, becomes small, 
n 
0) > Y, 
AMMAN ash Ouds ’, 
be gare er es th -ai2n,-1 = 4 for wy=1, 
iff 
and 
ENS (P4164 2 
Le ics 1 a,n,-1~ 43 
aa 


as yw, becomes large. (8) 


In each case the lower critical value can be dealt with in an 
analogous manner resulting in three competing confidence 
intervals; namely, CI,(1 - a) = (0,, 50 1 $293; with 0, 
defined similarly to #, in (8) with ¢,_, es. replaced by 
bara.n, . The competing confidence intervals are labeled in 
order ae decreasing length. 

The first case is equivalent to assuming that Op, is large 
relative to var(p,) and leads to using the usual 
unconditional variance but with degrees of freedom equal 
to n,— 1. Inmost practical problems this seems reasonable 
since Sp, is an unknown constant and var(p,) is O(p,/n). 
The second interval corresponds to adoption of a normal 
prior as noted above, with oe =P, (1.> py) in The; last 
confidence interval is based on the assumption that p, is 


essentially degenerate at p ,. 


2.5 Empirical Study for SRS 


We compared the several confidence intervals of 
Section 2.4 in a small empirical study, using artificial 
populations, for which the domain variable was normal. In 
all cases the population size NV was 1,000, and the sample 
size n was 100 or 300. The parameters p, and y, varied 
from population to population. Letting M, be the number 
of runs with n, > 2, we allowed the run size M to vary to 
give M, = 10,000. Table 1 gives coverage results. CI, 
represents the confidence interval based on the standard 
normal methodology. The results for CI, closely approxi- 
mated the results for CI, and are excluded. The value of M 
is included to indicate how many trials fell into the 
“insufficient information” pile, at a given setting of the 
parameters. Several conclusions seem warranted: 


1. Standard confidence intervals using the usual variance 
estimate and normal quantiles can give low coverage. 
This occurs for several values of p, when y, = 1/2 or 
y, = 2, however, the under-coverage is not too severe 
if the domain variable is normal. The case where 
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Y, =2 or takes even larger values is probably more 
likely in practice. Thus if the domain variable is normal, 
the use of standard confidence intervals under simple 
random sampling case is not particularly worrisome. 


2. The strictly conditional intervals (i.e., CI,) using the 
conditional variance can give abominable coverage, 
when y, is large. That is, confidence intervals based on 
“large” values of y, gave very poor results. 


3. The use of the standard variance estimate but replacing 
the standard normal quantile with a ¢-quantile having 
degrees of freedom based on the number of sample units 
in the domain (i.e., CI, ) gives approximately nominal or 
conservative coverage regardless of the value of y,. 


Table 1 
Coverage of 95% Confidence Intervals for Domain Total 
for Artificial Populations with 
Domain Variate Normally Distributed* 


Coverage 
ey n M Gi CI, Cr 
y=1/2 

01 100 38774 100.0 100.0 91.2 
300 11773 98.3 100.0 83.2 

.02 100 16327 lett 99.4 95.0 
300 10078 88.6 9555 O39) 

0S 100 10303 88.7 97.8 935 
300 10000 92.3 94.4 Opp) 

10 100 10001 90.9 94.8 a2 
300 10000 94.0 95.0 92.3 

y=2 

01 100 37749 ODS) 100.0 83.5 
300 11740 94.4 100.0 89.1 

.02 100 16348 99.0 100.0 88.4 
300 10075 91.4 98.9 74.7 

05S 100 10312 90.5 99.5 77.6 
300 10000 93.8 95.8 66.6 

10 100 10000 91.7 96.5 67.9 
300 10000 ~~ —-94.0 hy 65.0 


* See Equation (8) and accompanying text for definition of CI, 
and CI,. Cl, is the standard normal confidence interval. 


As a minor observation on the results, we note the 
counter-intuitive increases in coverage for smaller p, and 
n. We believe this is due to the fact that, at very small 
values of p, and n, p, is constrained to be positive, and so 
cannot deviate much below p,. Were intervals calculable 
for n, =0, there would be a serious drop in coverage in 
these cases. Note that the coverage rises unexpectedly only 
where M is large. 


3. THE CASE OF STRATIFIED RANDOM 
SAMPLING 
3.1 Definitions and Notation 


Assume there are K strata and, where appropriate, terms 
previously defined have corresponding stratum level 
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definitions. For example, n, is the sample size and n,, 
is the number of sample elements in A for the k-th stratum. 
Thus, a natural estimator for the domain total 


K K . 
Ly A Died > ea ME ees 


1 i Dea layit pen Ny Par Bane 

where p ,, = Mk! My, Bay = yi 1 X,;/N4, and B, ={k|n,,21 
and 1 <k< K}. As p,,=0 for k¢B,, itis Siidiehtforward 
to verify that 


a . K . 3 
Ef, sige) Pn P| = int N(Bae- Pa tar 2 Ay (9) 
and 


e . ied Use 
var|(Z, aD ,) PnP4| = dia Ny ParOag! Nyy = 
Dh vax n wpe ~2 
PAE Ng Par Fan! = 84, 


Whete Dp aU Pie Pap Pe Pace ac Lous, 
as in the simple random sampling case, there is a 
conditional bias fi,, which needs to be taken into account. 


3.2 A Methodology for Confidence Intervals 


The general methodology for confidence intervals of 
Section 2.3 for simple random sampling holds here as well. 
One need only reinterpret scalars as vectors; for example, 
replace p, by Pie =(PiieoP,.) « 10) patticular, 
A(x|p 4p ,) = Prf 6< x|B Pad will be the conditional 
distribution function of 6 = (7T,,- T,)/6,, where 6, isa 
re-scaling factor to be specified. 

Let B, = {k|n,,>2 and 1<k< kK} and, for keB,, 
define 64, = Y{t (xq - By)*/(14,- 1). Under normality, 
(Ay, - D6 4/1 ~ 4, - 1), So if {d,|keB,} are 
non-negative constants with )’,. B, d,>0, Me by the usual 
Satterthwaite (1946) two moans approximation, the 
conditional random variable 


Og Oe 

(110) Doce, Aan DCix/ Can) |B Pra 

is distributed approximately as a y7(v), where 
2 

<i a Ay (M4, Dh a, (n4,~ 1) 

and 
9) 2 
aie (jen: d,(n4,~ 1)) ropes d, (n4,~ 1). 


This suggests that we restrict our attention to expressions of 
the general form 


Pe Yeahs 
Peas Devens Ay (4p 1) 6 44/ O44 
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with choice of the d, to be specified. Note that when 
B, = B, and qd, NEB 4 San! (My 1), 64 =64 = YKeB, 
d, (H,, - 1)6%,/ o,, iS an unbiased estimator for the 
conditional variance 6. However, as in the simple random 
sampling case, this estimator will tend to be too small. We 
use the more general expression to develop a family of 
t-statistics when we “uncondition” on p,. Each of these 
will involve unknown parameters, and, as in the simple 
random sampling case (transition of equation (6) to 
equation (7)), estimation of these unknowns will be 
necessary. Thus the net result will be several rival “near 
t-statistics” which we may then compare empirically. 

Because the samples are selected Andependently from 
each stratum we have /(p,|p,) = Fy) Fe Pay|P4,) and, 
as aconsequence of our within stratum sampling scheme, 
n, p,, has a binomial distribution B(n,,p,,). We assume 
that the {p,,|1<k<K} are jointly independent so 
g(p,) = e18,(P,,) which implies 


SB, \Pg(Pp) = bie Fy Bar| Par 8k (Par) 
and 


h(p 4) = We [iba P48 Par) Pay: 


In what follows, we assume that the prior distribution of p ,, 
is N (Hy, 15, ai and for the empirical Bayes approach, we 
use WH, = a 4, and, analogously to the case of simple 
random eae we define 


zs K 2 
War = Pac ~ Bag)! Sp, .- 

It is straightforward to extend the result in Appendix A 

to the case of stratified random sampling and it then follows 


that, for fi, defined by OP TY oF ip is distributed 


NO, var (fi, |B ,)/54), where var(fi, | p,) = Ny, 
Mp Py, CA - B,)/n,C + Wy,)- Using the result in Appendix 
B, it follows that, conditional on p ,, the random variable 


g Ca Tad / Wrath P a) * Fa (T,- r,) / |wvarcii,1B,) +6, 
(6, /ev 
(f,- r,) | vari, [P4) + & 
a 2 
ie (M4, - NC apy) ee d,(n4,- 1) 


is distributed approximately as a central f with v degrees of 
freedom. 
Letting © = var(fi,|p,) + oF with 


2 lee 
Yak =~ Marl Ong 


and assuming the y,, are near zero we have 


ens Ne By,6 7 


keB, ny 


Cri ~Pyg i): 


Thus, the upper bound on the CI would be (approximately) 


i Wn, = 1G, fo, 
P ‘ phe egy — I) (Gag ee (10) 
y do ten, 4k Mak 1) 


where f¢, stands for the critical values of the ¢, distribution. 
Unfortunately the bound depends not only on our choice of 
the d,, but also on the unknown parameters yp ,, and o, Ak 
It 5 not hard to show that v < Veep, (4, ~ 1) = Vingx and, 
if we set d,=1 (or any constant for “that matter) then 
V=Vix: Wereferto v_.. specifically as the unweighted 


degrees of freedom. In this case the upper bound on the CI 


would be 
OLD 
A y De A, (M4, ~ 1) (644 / 04%) r 
7 ES, 
y SD rep, "ak =) 


Another approach is to attempt to finesse the problem of 
estimating © (at least when B, = B,) by a judicious choice 
of the d,. To that end let us assume that B, = B, and let 


Uu= 


Uu = 


N; p o 
Bes ome ee 
ny (4 — 1) 


so that yee, aCe, 
then have 


- 1) =© and @ cancels out in (10). We 


Ni Ba,Sat 

; 2 

Lies, Se aan (Yq. 
k 


sei as It,» 


where v, is the degrees of freedom associated with this 


second choice of the d,. More generally (i.e., when 
B, + B,), we have 
ae © 
Ny A O4K 2 A 
My Ss saan lar ipeae) al) 
u=T,+ @”*¢ 


A 2 
Ny Pa,CAk 2 A 
een A (Y¥4,C) = Pup) a 1) 
k 


In any event, we are still faced with the problem of 
estimating the population parameters and we have the 
additional problem of estimating the degrees of freedom. 
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A third possibility, which we have already mentioned, is 
to let. dy =N; Bap Saul (an 1), so that when B, = B,, 
6, = 6, = Les, EAN, ye 1)82,/0,, is a conditionally 
unbiased estimator for 6 4- In this case we have 


PE 
| vice, Ny Py San! My rate 
» N2 Ae / yo 
rep, Vk PrOak! 


where v, is the degrees of freedom associated with this 
third choice of the d,. As in the second case, we are faced 
with the problem of estimating the population parameters 
and the degrees of freedom. 

Now, it should be noted that if we estimate 07 4g With 6, Ak 
for ke B, and let © be a yet to be specified estimator of © 
then the (estimated) upper bounds above are u= iD + 
6” tv , u=T, + 6” t, and u = ey +@” i aan 
The degrees af freedom are Lest by substituting 
estimates of the population parameters into the two 
respective choices of the d,. Both ¥, and ¥, are smaller 
than v,_,,> $0, for any ES value of 6, the confidence 
interval using v,,,, Will be the shortest. There is no general 
relationship between the sizes of ¥, and ¥,. Empirical 
evidence indicates that there is little to ifeate Oise the 
second and third approach. 

Addressing the problem of estimating ©, we can write 


=): Ne Baal Hiall - Bx) - onn)/ ia 
2 


keB,-B. 


B= TD at 


» Ne Bae (Haell ¢ Bx) & oi4)/ My 


For ke B, - B, the estimator 6, “4k 1S not defined, however, 
it is straightforward to verify that (1 - 6,,E[ fp hassles 
fee = Bap < Elfae| tal: It follows that 


2. ee) 
ve D5 Ny Pax ~ Byy) Bag! My + 
keB, 


Yo Np By Sal + 1/n,- 1/n,,)/n, 
keB, 


will tend to underestimate ©, and 


2 ae eA De pr ee 
tL, NeBata/%* NEBax(1~ Bay) On / + 
keB, -B, keB, 


YS Me By S41 + 1/0, - 1/n,,)/n, 
keB, 


will tend to overestimate ©. Clearly, s & < s} with equality 
only when B, = B,. 

It can also be verified that in the case of stratified 
sampling, the standard variance estimator for estimated 
population totals is 


63 
2 2 he a8 AS 
Sst 5 Ny Sx /n, =: Ny Pag - By) Man | (% ~1) 
ED, ED, 


PIR WA? 
zs a Ny P44 ~ 
keB, 


Li JAGip= 1). 


This looks like a satisfactory estimator of ©, if the n, are 
not small. 

These results imply that Cls of the form Ce #550 a)2,9, ) 
will provide the highest level of coverage; but Cls of the 
form (7 Satta, ra ) andeven perhaps dy tS catia, 5) 
have obvious computational advantages. Sas of these 
competing forms of CI are evaluated empirically in Section 
3.3. These results can easily be extended to ratio estimators by 
the standard linearization approach. 


3.3. Empirical Investigation for Stratified Random 
Sampling: the BLS Wage Data 


With a view to improving estimation of precision on wage 
data produced by the U.S. Bureau of Labor Statistics, we 
investigated coverage and interval length in two simulation 
studies on populations constructed from a test sample of the 
Occupational Compensation Survey Program (OCSP) 
conducted in 1991. The OCSP consisted of establishment 
surveys in several metropolitan areas, aimed at estimating 
wages levels for a select group of occupations. The surveys 
were Carried out by stratified simple random sampling, with 
establishments stratified by employment size and industrial 
classification. 

One population (the “Small Population”) took the test 
sample itself as the population, with six non-certainty strata, 
and one certainty stratum of 12 establishments. Five hundred 
stratified random samples were taken from this population 
at sizes n = 36 and 60, corresponding to the choices n, = 4 
and n, = 8, reflecting relative sample sizes of sampling 
from the original population. The second population (the 
“Large Population”) was constructed by expanding the 
sample data through replication (by simple random sampling 
with replacement, within each Small Population stratum) of 
establishments to achieve a population the size of the original 
population; again there were six noncertainty and one certain- 
ty strata; foreach stratum sample sizes were the same as in the 
actual sample. Domains are defined by the different occupa- 
tions of interest; only a fraction of establishments have 
workers in a particular occupation, and lie in the correspon- 
ding domain. Table 2 gives the number of establishments 
having workers in the selected occupations for the small 
population. 

In both cases sampling was without replacement, so 
finite population correction factors were included (as 
appropriate) in the construction of the CIs. Also, the study 
was limited to a concern with 95% coverage. 
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Table 2 
Number of Establishments in Given Domain (Occupation), 
by Stratum for Small Population 


stratum 
Occupation 1 ers} A ed a OS PRT. total 
4021 Oe 45 al Ole Se Oman), 50 
1141 Op Sr al i Wb ew 48 
1122 Oe ese meatiorelley aks de 56 
3180 LOW nec Ot come 80 
2911 ORR eS CAE 213 Oe 17m ae 56 
1142 DRS HSNO MS sO qi) 
1180 Le 2O ES Th 6 le BS eae 1 138 
1403 12 OWN eee 8 eee ieee 139 
All Estabs ey Gy SEP shy or stay 353 


Small Population: Table 3 gives coverage and median 
relative interval length for total wages, at two sample sizes n, = 4 
and n, = 8, for 8 occupations, and three methods of confi- 
dence interval construction: the standard variance estimator, 
saa; with the standard normal z-quantile, the unweighted 
degrees of freedom v_,,,, and the weighted degrees of 
freedom v,. Occupations are ordered by increasing values 
of the average value, over runs, of the unweighted degrees 
of freedom. We note: 


1) 


2) 


Occupation 4021 1141 


Almost universally, coverage using the standard vari- 
ance estimator and the standard normal quantiles 
(infinite df) is poor. 

Coverage for the other interval types is far more 
satisfactory. In general, the coverage is near the nominal 
95%, or slightly conservative, for weighted degrees of 
freedom; as expected, intervals based on unweighted 
degrees of freedom tend to yield coverage a few points 
below those based on weighted degrees of freedom. 


Table 3 


3) Two occupations (1122, 4021) yield seriously low 


coverage for totals even with the improved procedures. 
Investigation of these particular occupations suggests 
a strong violation of the normality assumption. In 4021, 
for example, two units in stratum 5 have a number of 
workers, and hence total wages, an order of magnitude 
higher than the other establishments in this stratum and 
indeed in the population. Furthermore, the wage rate of 
these two outliers is markedly lower than the great bulk 
of establishments: with just these two excluded from 
the population, the overall population average wage 
would be $9.68/hour; with them in, it is $8.28. Since 
there are 66 establishments in stratum S, it is easy for 
these two establishments to escape being in a sample of 
size 8; the consequence is a serious overestimate of the 
mean wage or underestimate of total wage. At the 
same time, wages for the establishments that are in the 
sample are relatively homogeneous, so the variance 
estimate will tend to be too low. The presence of 
several smaller establishments in the domain contribute 
to enlarging the degrees of freedom, and so the 
t-adjustment is unable to compensate fully. It is hard to 
see how to guard against such a problem short of 
having prior information, and allotting such outliers to 
a certainty stratum. Even so, the adjusted intervals are 
a significant improvement on the naive normal 
distribution based interval. 

Interval lengths are taken relative to 2xz,,,~4 times 
the root mean square error of 7, calculated over runs. 
We report the median of these standardized lengths 
(across runs). When the distribution of 7 ', 18 actually 
normal, the median length is close to 1. 


Estimated degrees of freedom, coverage, and relative median length of Cls for total wages of workers in occupation, 
for the small population 


Four Sample Establishments Per Stratum 


1122 3180 2911 1142 1180 


Coverage 


Eight Sample Establishments Per Stratum 
1141 4021 1122 3180 2911 1142 1180 1403 
Bi 3.009 Oe nO Ono: Omen le2 a LOT 


20S PI a at oh) SS ree GMM SS! Une Of 


74 Bh AG” ORGS 7D. © TS” RG e” a See eo? 
‘Sie 65 See So bn sbu, 90 0 90 meron 


Ql 5) 4 ne >) 1 © OM) © EO 


Median Relative Length 


df =v... LSI PEST” 1.609 ty, Ppenesoigat 43 
df=, Ne Ra FI RP SN Sa RNG RPE) 
pers CY er eed gay] Son eae ees 
T,*soat, 89) )y292.1901i.03\ mn. OOLENF OS: ke a OGr ARID 
Deeks. 02 03 95 OU 06 0b. 208 
T Pes. 2 0.53 0.75 0.59 0.70 0.74 0.85 0.90 
T,+5,,t, 2.65) 3:61 2.30i952.60 02120 (4.198 1551.50 
Psst; 3730 4.32 seolo) mS 40S am DO az70 


0.87 0.63 0.66 0.80 0.83 0.88 0.92 0.96 
1O3ie OO Real Sie LOL 1 One| OG maple O 2 aale Oe: 


SiLOfs AK), DES PAOD TIES dl aks} Tyee The NG 
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4) The relative interval length of the standard interval 


tends to be too small, that is, it tends to be less than 1. 
5) Interval length among the other variance-degrees of 
freedom combinations is largest for Seu with ¥,, and 
smallest for So with v_,,- These differences can be 
appreciable; there is a tradeoff between coverage and 
interval size. 
6) For a given interval type, the relative interval length 
tends to 1 as v_., increases. The conclusions from a 


study of mean wages are similar. 


Large Population: Table 4 gives coverage and interval 
length for total wages for five interval types, and a wider 
range of occupations, ordered by average v,,,. The 
interval types include the three used previously for the small 
population. The two new intervals utilize the weighted 
degrees of freedom together with s, and s, respectively. 
Results are based on 5,000 runs. 


1) The results are consistent with those for the Small 
Population, in terms of the relative coverage and interval 
sizes of the several interval types. The standard normal 
is unsatisfactory for many occupations. 


2) The coverage for intervals using the weighted degrees of 
freedom, ¥,, is less than 90% for only a small fraction 
of cases. 
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3) There can be marked differences in interval length for 
the different interval types; however, all ratios of 
interval length to 4 x root mean square error tend to 1, 
as v_., gets large. 


4) Little difference results from using S$,» 54> OF S,., with ¢, . 
1 


Again, the results for mean wages, while differing in detail, 
lead to the same overall conclusions, and are omitted. 


4. SUMMARY AND CONCLUSIONS 


From our theoretical investigation and simulation work, 
we draw the following conclusions: 

1. Standard 95% confidence intervals for domain means or 
totals, when based on the standard normal distribution and 
standard methods of variance estimation, tend to yield less 
than actual 95% coverage. The extent of the deviation will 
vary with domain (occupation in the wage study), but can 
be quite considerable even when the sample size is large. 

2. New nonstandard methods offer a sharp improvement, 
giving intervals with better coverage, typically at or 
close to the nominal 95% coverage. These intervals tend 
to be longer than the standard intervals. The increase in 
length will vary with domain, and will depend on the 
particular method for CI construction that is adopted. 


Table 4 
Estimated degrees of freedom, coverage, and relative median length of Cls for total wages of workers in occupation, 
for the large population 


Occupation 


1718 1604 1802 1716 2911 2052 1332 1141 4021 1232 2853 3020 1122 1142 1714 1514 3180 4030 1063 1403 1180 


df = V wax 29795 3:45 4144 11.9) 12.4 13:10 15.3, 16.9 16:8 17:3 20:6 24:9 28:0 28.6 29.1 34:8 41.5 59.9 77.6 77.9 128 
df=, IMO Peshe) esks) SEH) IED. 495) Milt) QTD sy). sys) iekey WOES iG) Ceows Ieys) IMO) Pe Rey Ae Oto CLOG 
Coverage 
T 45,42 de) 450) els) Stale) 098 BSS) OD OD GD 88. ak) eb Oe ee Bi ee ee ee 
pea (OOO Seo 4 9) Oo le. O55 Eo le 945 294g OSM SSh 90% 786s 95092, a Se 95 Nt 94 FOS 
jes, Oi ats Sub Sil SS) i I EO) CH = helo RL CB EP by VEE ee 
Tisai, SY ak) Se Se 20 Sy SO Ol Me ME Se oo LO) Wil ay eb Ge bey Gs) Ley 
T, 25, (7 OOM LOm -O2iero Ober Ov 9 ONE OO mo 95" B94 OSuer SO mee 1 Sie 795m) 7.93: aeS3e 895 e94 Ah1.95 
Median Relative Length 
PS cuq2 0.99 0.78 0.92 0.97 0.95 0.96 0.99 0.98 0.96 0.97 0.98 .98 0.95 0.96 0.93 0.98 1.00 0.91 1.00 1.00 1.01 
ae aah 2.14 1.47 1.40 1.08 1.06 1.06 1.08 1.06 1.04 1.04 1.04 1.03 0.99 1.00 0.98 1.01 1.03 0.93 1.01 1.01 1.02 
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For domains which yield large samples, there will be 
little difference from standard intervals. 


3. The instances where coverage fell below nominal, even 
using the ¢-adjusted intervals, may be ascribed to severe 
violation of the normality assumption for the domain data. 
Thus the f-adjustment is not a cure-all. Nonetheless, even 
in such cases there is a good deal of improvement in 
coverage over the use of the standard normal interval. 


4. The key idea behind these intervals is to condition on 
the amount of information on the particular occupation, 
which, roughly speaking, is measured in terms of the 
number of units in the sample that belong to the domain. 
The fraction of such units within each stratum is 
unknown, and to handle this fact we put a prior 
distribution on this unknown, reflective of the degree of 
our ignorance of it, an idea we borrow from the 
Bayesians. However, in the final analysis, it is the 
realized coverage probabilities that determine the merit 
of the approach. 


5. The principal effect of these ideas is the abandonment, 
for purposes of CI construction, of the standard normal 
quantiles (+ 1.96 for 95% coverage). These are re- 
placed by quantiles from the Student’s ¢-distribution, 
with degrees of freedom determined from the sample 
and varying with domain. If because of publication 
requirements or for other reasons, there is need to report 
standard deviations rather than confidence intervals, 
then we recommend reporting an effective standard 
deviation given by the length of the proposed t-based 
95% confidence interval divided by twice 1.96. 


6. The standard estimate of variance seems acceptable for 
estimating the variance, when accompanying the new 
f-quantile. In most instances this combination should be 
quite satisfactory, so that the only change from standard 
methodology will be the introduction of adjusted 
degrees of freedom. However, in some instances, the 
alternative standard deviations may improve coverage or 
reduce the length of confidence intervals. 


7. An open question concerns what degree and type of 


collapsing of strata (if any) should be used in the 


estimation of variances and of the degrees of freedom 
for the purpose of confidence interval construction. In 
general, there will be a tradeoff: as strata are reduced in 
number, the estimate of variance will tend to increase, 
but so will the degrees of freedom (reducing the size of 
t, . or t;.) The answer to this question may be 
population specific, and experience from past surveys 
useful. 
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APPENDIX A 
From the discussion in Section 2.2 we know that np, 


has a binomial distribution Bin(n, p,), hence, for p WinO3 
ies URS eos We 


a Pee I Rey 
f(B,|P4) wtp AM Ma AEE 2) Ne te iy 
LG 2) tL Orda Dy deed) 
(np4+1)-1 (N( -B,4)+1)-1 
Pa depp =k, (pln + 1). 


For each (fixed) value of p,, the function k, (p,) is the 
pdf of a Beta distribution with parameters @, = np, + 1 and 
@, =n(1-p,) +1. As both @, and a, will be larger than 
unity with high probability (at least in most real world 
situations), it is reasonable to approximate ky (p,) witha 
normal pdf having equivalent mean and variance, which are 
approximately p, and p,(1 - p,)/n respectively. 

Assuming that p,~N(,0*), it follows that the 
posterior distribution is 


h(p4\B,) =f(P,|P8(P,)/ 


) ( (PrP (Pa ut) 


"KD 2\ py(-pyin 2 
i Kb,\P,8(p,)ap,= ce \™ Pa ° 


where c is the normalizing constant. 
Under the “empirical Bayes” assumption that p = 6, and 
o =p,(1- p,)/n we have 


-{ (PaPaY? 
] o 2A Bal -By)/2n 
y2n,/p,(1 - p,)/2n 


If we drop the specific assumption regarding o*, and let 
w= (p40 - p,)/m)/o* then [p,| By] ~N(b4,6,(1 - 
BP, +w)n). 


h(p,|P4)* 
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APPENDIX B 


Result: Assume W is distributed N(0, c*) and, conditional 
on We=w, the random variable T is distributed as a 
non-central ¢ with v degrees of freedom and non- centrality 
parameter w. Then, the unconditional distribution of 
T/y¥c? +1 is central ¢ with v degrees of freedom. 


Proof: First notice that 7 can be written as T=(X+ 
W)/ VAY 2/y, where X is distributed as N (0, 1), S 2 is distri- 
buted as y°,, and X, W, and_S? are mutually independent. 
Therefore, X’ = (X + W)/y1 + c” is distributed as N(0, 1). 
As X’ and_S? are independent, it follows by definition that 
T’ =T/V¥1 +c? =X'/yS*/v is distributed as ¢,. 
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On Regression Estimation of Finite Population Means 


GIORGIO E. MONTANARI’ 


ABSTRACT 


This paper examines the main properties of the generalized regression estimator of a finite population mean and those of 
the regression estimator obtained from the optimal difference estimator. Given that the latter can be more efficient than the 
former, conditions allowing this to happen are established, and a criterion for choosing between the two types of regression 
estimators follows. A simulation study illustrates their finite sample performances. 


KEY WORDS: Generalized regression estimator; Difference estimator; Auxiliary information. 


1. INTRODUCTION 


Regression estimation is an effective technique for 
estimating survey variable finite population means or totals 
when the population means or totals of a set of auxiliary 
variables are known. The problem can be stated as follows. 
Consider a finite population © = {a,, a), ..., ay, } consisting 
of N units labelled 1, 2,...,N. Let Y, be the ae of unit a, 
of a survey variable y Shee population mean Y = yy 1 IN 
has to be estimated by means of a sample drawn from. °. 
To this end let us suppose that the population mean 
Ne yal x,/N of a q-dimensional auxiliary variable vector, 
having (x. = (55, Xone 1X4)" as its value for unit Gis 
known, for example from administrative registers or a 
census. The entries of x, can be quantitative as well as 
indicator variables denoting the membership of the unit to 
given subpopulations. Let s be the set of sample unit labels 
obtained from a sampling design having first order 
inclusion probabilities 1,,i=1,2,...,.N, strictly positive. 
Then, a regression estimator can be written as follows 


¥,=¥+(X-X)'B, (1) 
where Y =)’. .Y,/Nn, and X = ),.,x,/N7, are the Horvitz- 
Thompson unbiased estimators of Y and Xx, respectively, 
and p is a vector of regression coefficients, given by some 
function of sample data {(Y,,x,'), ies}. Briefly, Y, is 
obtained by adding to the unbiased estimator Y terms 
proportional to the difference between the re means of 
the auxiliary variables, Xx, ayy x,,/N, k = eg, and 
the corresponding es maies X,=Y; eG i 

This paper discusses fle two chief methods of 
constructing the vector p and the properties of the 
corresponding regression estimators. A criterion based on 
a first order approximation analysis is then given for 
selecting one of the two alternatives. Finally, the results of 
two empirical studies, carried out to explore the finite 


sample performances of the examined estimators, are 
reported. All unsubscripted expectations and variances are 
taken with respect to a sample design. When calculations 
are made with respect to a model, a subscript m will be 
used. 


2. MAIN PROPERTIES OF THE REGRESSION 
ESTIMATOR 


Mild restrictions on the second order inclusion proba- 
bilities of the sampling design and on the limiting pop- 
ulation moments of Y, and x, are sufficient to ensure that 


the estimator Y, can be approximated by the difference 
estimator 


nx — X)EB, (2) 


where f is the limit in probability of the vector Bp, when 
both the sample size and the population size go to infinity, 
and the limit is defined as in Isaki and Fuller (1992): Wright 
(1983); Montanari (1987). Then, the large sample perfor- 
mance of the regression estimator can be studied by means 
of its linear approximation (2). As a consequence, the 
regression estimator ie is approximately unbiased, because 
Y, is unbiased. The sampling variance of NA can be 
approximated by that of Y. . given by 


V(¥,) =V(Y) +B/'V(X)B-28'C(X,Y), —@) 


where V(Y) is the variance of. Y, V(X) is the qxq 
dimensional variance matrix of X, and C(X, Y), is the q 
dimensional covariance vector between X and Y. Since 
Y, can be rewritten 


¥.=X'p+> —, 
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where.U, = ¥, =x B, then 
= 2 Ram) Bow TT 
Vir = 0 re eee 
» Nex i jl)’ N?ak 


An approximately unbiased estimator of V(Y ,) 1s given by 
the Horvitz-Thompson formula 


~ besa eT: 

A = = A2 i A A ij i j 
rE) -D PLS ¥ 0,9 SL 
i€s N TC; ies j#i N 1, 1, 1; 


where U, aT e= x, B. Alternatively, when the sample size 


is fixed, the Yates-Grundy variance estimator is available, 
1.é. 


U, gs 


T, xe 


(1,1, ~ T,) 


b oN a 


Ss 


Henceforth V(¥,) will be called asymptotic variance of 


> 


+! 


3. THE GENERALIZED REGRESSION 
ESTIMATOR 


Two methods are generally used for constructing the 
vector B. The first one has been developed within the 
framework of the model assisted approach to survey 
sampling inference, as it is described in Sarndal, Swensson 
and Wretman (1992; sec. 6.4) and Estevao, Hidiroglou and 
Sarmndal (1995). Letting Y, be either a random variable or 
an observation of it, consider the following linear 
regression superpopulation model 


AE ADSM PD Bevis HO 
PPC Oye 
Bop, Yp. 200, naires, (4) 


where E, V,, and C,, denote expected value, variance and 
covariance with respect to the model; B and o? are 
unknown model parameters; v, is a known function of x. 
The vector 


V. 
i 


is the census least squares estimator of B. Under general 
conditions, such as those quoted in the referenced papers, 


XY. 
/ IT 


(5) 


is a consistent estimator of B, and when replaced in (1) 
gives the generalized regression (GREG) estimator 


oP 9) 1, (6) 


In addition to those stated in section 2, this estimator has the 
following properties: (i) the means of the auxiliary variables 
estimated through GREG equal the corresponding known 
population means, i.e. DG = X; (ii) the model expected 
value of the asymptotic eae variance, i.e. EF, V(Y as 
is a minimum among all asymptotically design-unbiased 
estimators of Y (Wright 1983). Consequently, if the model 
is well specified, no other asymptotically unbiased 
estimator exists that is on the average (with respect to the 
model) more efficient than Y,,. 

Well known estimators currently used in practice, such 
as the ratio and post-stratified estimator, belong to the class 
of GREG estimators. Furthermore, such a class has recently 
been extended by means of the calibration technique 
(Deville and Sarndal 1992) to better control the variability 
of the final observation weights. 


4. THE OPTIMAL ESTIMATOR 


For constructing an alternative regression estimator 
based on the same auxiliary variable x, a second approach 
considers the vector B that minimizes the asymptotic 
variance (3) of the difference estimator (2). Assuming 
V(X) non singular, i.e. there are no linear combinations of 
the entries of X with a zero sampling variance, the 
minimum variance vector is given by 


B, =[V(X)} CO, ¥). 


Now, consider the unbiased estimators V(X) and C(X, Y) 
of V(X) and C(X, Y), respectively, that exist provided 
that the second order inclusion probabilities of the sample 
design are all positive. They are given by the Horvitz- 
Thompson formula or the Yates-Grundy formula when 
applicable. For example, using the former we have the 
estimated covariance vector 


cy \ eal, Wal Tl. puedes 
AC OAEDE KHOA DS OS eI 
i€s 2; ies j#i N 1, Te, 


Using V(x ) and C (x . y ) we get the alternative regression 
estimator 


where B, = (VW(X))'C(X,Y). It was studied by 
Montanari (1987) and called by Rao (1994) the optimal 
estimator. When V(X) is singular and its rank is q’ <q, to 
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define the optimal estimator it is understood that one or 
more entries of x,, hence of X have to be dropped in such 
a way as to obtain a q'xq' non singular variance matrix. 

Using the expression for B, the asymptotic variance of Ys 
simplifies to 


A A 


GAA VEY) = COX OXI GC Xek)) sa) 
The properties of the optimal estimator are: (1) asympto- 
tically, the efficiency of Ky is not inferior to that of ee 

pee VY, AE: AG aye (ii), he means of the aehan 
variables eSamated through the optimal estimator equal the 
corresponding known population means, i.e. Xi, =X. vAs 
for the case of the GREG estimator, when We is more 
than one survey variable, the optimal estimator Y 2 can be 
expressed as a simple weighted estimator with the same 
weights applying to all variables of interest. For example, 
using the Horvitz-Thompson formula for variance and 
covariance estimators, we can write Y,, =), Y,w, where 

w= 2+ - XY VY 
Tt 


i 


he 7 T.. — 1.1 
x. Ba Wy SAE 
i J ‘ 
N2xt) jet ONO R TT, 
ees sy 


A similar result can be achieved with the Yates-Grundy 
formula. 

Note that the asymptotic optimality of it is a Strictly 
design based property, achieved esnciabnelly on the 
realized finite population (hence, within the fixed popula- 
tion approach to the finite population inference). On the 
contrary, the asymptotic optimality of ro requires the 
model to be true, and concerns the average asymptotic 
variance over the finite populations that can be generated 
under the model. 

Because of these results, re would seem preferable to iy 
However, B, is a function ‘os: population total etimarors 
and B, is a function of variance and covariance estimators. 
As a consequence, the former is more vulnerable to model 
misspecification, and the latter is more vulnerable to 
sampling fluctuations. In a finite size sample, oP is 
generally less stable and more complex to compute and its 
variance can be greater than that of lige: ; see Casady and 
Valliant (1993). However, if an equate number, g, of 
degrees of freedom are available for estimating B,, the 
instability problem of Y,, can be overcome. For example, 
for standard complex sampling designs having with- 
replacement sampling at the first stage, g can be roughly 
taken as the number of sample clusters minus the number of 
strata (Lehtonen and Pahkinen 1995; p. 181; see Eltinge 
and Jang 1996, for more elaboration on this topic). A stable B, 
can be expected when g is large enough relative to the 
dimension q of the auxiliary variable x,. Since with 


a1 


modern computers the computation of We, is less 
problematic, it becomes interesting to develop a criterion 
for recognizing when such an estimator is truly 
advantageous. 


Saar CRITERION FOR CHOOSING BETWEEN 
Y,, AND Y,, 


Consider the following theorem: 
Theorem: Let V(Y,) and V(Y,,) be the asymptotic 
variances of the general regression estimator Y, and the 
optimal estimator Y,,, respectively. Then 


A 


V(Y.) - V(¥,.) = C(X, ¥)' VRC, Y). 8) 


Proof: Using (3) and (7), the difference in variances is 


V(¥,) - V(¥,.) = B’ VOX)B - 2B’ C(X, Y) + 


A 


C(X, Y)' [VX]C(X, Y). 


A 


Since B, = [V(X)]'C(X, ¥) and Bp’ CCX, F) = B’ V(X) B, 


we have 

V(Y,)~ V(¥,.) = (B - B,)' V(X) (B - B,). 
But, C(X, Y,) = C(X, Y) - V(X)B = VX), - 
follows. 

Note that the right hand side of (8) is a positive definite 
quadratic form and it is equal to zero if and only if 
C(x, yy) = 0. Therefore, the smaller the absolute values of 
the entries of C(X,Y,) are, the smaller the difference 
VY, )- V(Y 2) 18. The main conclusion the theorem pro- 
vides us is that an efficient use of any known auxiliary 
variable population mean requires us to adopt estimators 
that are uncorrelated with the auxiliary variable mean 
estimator. 

Applying the theorem to the GREG estimator, let us 
consider the k-th entry of C(X, Y,,) that can be written 


ps) and (8) 


where'U,=¥/= x,’ B,. If the superpopulation model (4) is 
well specified, it follows that £,,(U,) =0, for alli, and 
E_,(C(X,,¥,,)] =0. Therefore, C(X,,Y,,) must be 
approximately zero for all & = 1, 2, ..., g, being proportional 
to a weighted average of N uncorrelated random variables 
with expected values zero. Consequently the difference 
V(Y,,) = VAY J must be negligable. The result suggests 
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using the more practical Y,,. The conclusion is that the 
estimator Y,, can achieve substantial gains in efficiency 
compared to Y,, if the superpopulation model upon which 
the latter is based is not good enough. This can happen 
because of the specification of the linear superpopulation 
model is being confined to regressors with a known 
population mean. 
Since the following quantity 


L(¥., Yin) = Ce Mee AX Xa pice) 


igdle 
gives the asymptotic relative gain in efficiency that can be 
achieved with ae compared to Y »p We propose it as an 
indicator of a model inadequacy for extracting all 
information from the sample. When (Y,,, Y,,) is greater 
than 10% or 15%, say, the optimal estimator should be 
adopted. Provided that the second order inclusion proba- 
bilities are all positive, under general conditions 4(Y,.,, Y,.) 
can be consistently estimated from sample data. Then, the 
information offered by the estimate 4(Y,,, Y,,) can be used 
for shifting from Y,, to Y,, in the next repetition of a 
periodic survey, or, as we suggest in section 6, within the 
same survey, choosing between Y 1 and Y 7 pat, the 
estimation stage. 
This section concludes with a few examples. 

Example 1. Consider a simple random sample of units 
and the linear regression model through the origin 
E,,(Y,) = x,B,V,(¥;) = 0x, C,,(¥). Yj) = 0, i * j, assuming 
X known. In this case the GREG i is the ratio estimator of 
the mean, i.e., Y,, = Xy/x, where y and x are the sample 
means of y and x, respectively. The linear approximation is 


Y,,=XR+Y,.,U,/n, where U; = Y,- Rx, and R = Y/X. 
Then, the covariance of x and iy is 
~ a iS 
CG re) 2 Mami 2 ios oi (9) 
Nn Ss? 


where SE is the population covariance between y and x and 
We is the population variance of x. If the model is well 
specified, then Sy ul Se = R and expression (9) must be 
approximately zero. Otherwise, the greater the absolute 
value of an intercept in a census linear regression of y on x, 
the more Y,, is asymptotically efficient than Y,,. The 
result is not new (for example, see Cochran 1977; sec. 7.5), 
but it is achieved within the framework of a general ae 
Note that 4(¥,,, ¥,,) = [S,, /S; - RPS;/S2, where S? 
the population variance of U,, is a constant with is to 
the sample size. When MP. Ya) is not negligable, Y_, 
should be chosen as regression estimator, or, alternatively, 
an intercept plugged into the model in order to use the 
corresponding GREG estimator Voss However, for simple 
random sampling both solutions ane the same estimator, 
i.€., Y= Y -y but in general they are different, even for 
a SRR NE designs. 


Example 2. Consider a stratified random sample and the 
linear homoscedastic regression model EF (Y,) =a +x,B, 
V(X) = 077 C.(¥4 Y,) =0,i+j. Assume that X is known 
and that individual x,’s are known only for sample units 
and not for the nonsampled units. Now, the auxiliary 
information is given by x, = (1,x,)’ and the corresponding 


GREG estimator can be written Yes =Y+(X- Be where 
Aes Oe Y,x,/N1,) J XY 
1 ae. 
Oa Nm) Ke 


and where the estimated a cancels out. Because B, = S,,/8, 
and U,=Y,- Y- BaiGes= X), we have 
By Vc re NT (N, = ify) os os 
Cx, Y,,) =) + 57.6,,- 8), 
h=l N°n, 


ie: the subindex h denotes stratum quantities and 
B..= sl Seo The right hand side of (10) is a function of 
the ee between each within-stratum regression 
coefficient and the coefficient for the whole population. If 
the model is well specified, the differences Se B, must 
be negligible. Otherwise, C (X, yi -) can take non negligible 
absolute values and, since only X is known, the estimator Nes 
appears to extract better all the information from the ernie 
value of X 
It is interesting to note that when the allocation of the 
sample is proportional, i.¢., 1, « N,,, ignoring terms of order 
1/N, relative to unity, Y, is AVA to the GREG estimator 
Kase on the auxiliary variable x, =(di5.@ ae a. pate 0 
and v.=1, where d,, is an indicator eran of the 
membership of unit / to stratum h = 1, 2, ..., H. This model 
fits different regression lines with a common slope within 
the strata. 
Example 3. Consider a complex sampling design and 
suppose that the population can be partitioned into H post- 
strata of known sizes. Assume the superpopulation model 
E,, (Y;) = Byiy Vn (Y;) = 9 and C,, (¥;, ¥,) = 0, i + j, where 
the subindex h(i) denotes the post-stratum to which the 
i-th unit belongs. Denoting by d,, the indicator variable at 
the i-th unit membership to post-stratum /, and with D, 
known population mean, putting x, = (d),,d,,, .-, dy)’ bei 
;=1, in (5), we get the post-stratified estimator, 
=) D YA aioe where zh and Dy are the Horvitz- 
il drat ee estimators of fhe waren Z,; = Yidni and 
an respectively. The linear approximation is Y, = r= 
+ (X - X)'B,, where Bie =(RyRictehte) bRyeZ,l Dp 
" e., the mean value of y in the h- th post- stratum), and 
X a(Dis:Dyleag lig raw since whU, = Yor se the 
covariance oh D. and Y,,, is 


, H 
,D,) eps 


A 


COMID Vere 


S, 


CO, Dy (11) 
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Under the superpopulation model upon which Y,, is 
based on, we have E ACWouDe)i =O and a negligible 
value of C(Y ae D,,) is expected for all A. It can be easily 
seen that for ane random sampling, formula (11) is 
identically zero. But in complex sampling schemes such 
covariances might take non negligible values, for example, 
when in a multistage sampling scheme a linear regression 
of the primary unit totals of z,, on the totals of d,, yields a 
non negligible intercept for some h. See G@asaty and 
Valliant (1993) for a case study. 


6. EMPIRICAL STUDIES 


The above analysis is based on first order approxi- 
mations. In the following empirical studies the finite sample 
performances of Y,, and Y_, will be explored within the 
framework of example 2. 


6.1 The First Empirical Study 


In this first empirical study we consider a population of 
infinite size subdivided into two strata of equal weights and 
a proportional stratified random sampling design to estimate 
the mean of a survey variable y. To this end, let us suppose 
that there exists a scalar variable x that was not available for 
stratification but with a known population mean Y and 
unknown stratum means (i.e., the x values are not available 
for nonsampled units). 

Since only the population mean of x is assumed known, 
a reasonable superpopulation model that can be assumed to 
identify a GREG estimator is the linear regression one, with 
homoscedastic errors, i.e., EF, (Y,;) =a + a aw en ONE On 
Ades Y)=0,i#j. The auxiliary variable plugged into (5) 
is x, = (1,x,)’ and the corresponding GREG estimator can 
be written 


Ys =yt (Xx - X) Sip has 


where y and x are the sample means of y and x, s_. is the 
sample covariance between y and x, and s is the sample 
variance of x. The linear approximation is 


~ 


Gai Sigal’ Gee a ee fo 


2 
where Sh and S, are the population analogues of Sy and 
ce 

Dropping the first component of x, = (1,x,)’, whose 
mean is estimated without error, the optimal estimator 


based on the same auxiliary variable is given by 
Yo =7 + (X=x)CU, ZIV), 


where _X is the population mean of x, C(y, x) and V(x) 
are the standard unbiased estimators of the covariance 
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between y and x and the variance of x, respectively. The 
corresponding linear approximation is 


Vine ety) Clo, XI KL), 


where C(y,x) and V(x) are the true covariance and 
variance. 
In this case, the expression of MY Y,,) simplifies to 


yale 


ee ime Sis DSS, 


(Es Bs ’ 
ee Se a 


and it can be estimated replacing the population variances 
and covariances with the sample analogues. 

Four simulations were performed. In the first two, the 
sample values of x were drawn from a uniform distribution 
on [30-70] in the first stratum and [50-90] in the second 
one. The sample values of y, given x, were drawn from a 
normal distribution with expected values 1.26x in the first 
stratum and 0.82x in the second. The conditional variance 
was 8x in both strata in the first simulation and 3x in the 
second one. In the third and fourth simulation, the sample 
values of x were drawn from a linearly transformed gamma 
random variable with parameters chosen to achieve the first 
two simulation stratum means and variances for x and y and 
an asymmetry index for x (given by the ratio between the 
third central moment and the third power of the standard 
deviation) equal to 2.5. This allows studying the effects of 
a strong asymmetry in the marginal distributions of y and x. 

The populations were constructed to have Vege Keer 
8.1% when V(Y| x) = 8x, and GA AY, yy) = 18. 6%, when 
V(Y |x) = 3x. Note that the GREG estimator based on the 
true model is the separate ratio estimator; however, its use 
would require the knowledge of the stratum means of x, but 
they are assumed unknown. 

In each simulation we drew 10,000 samples of size 20 
(ten units per stratum), and 5,000 of size 40 (twenty units 
per stratum). For each sample we computed the values of 
the Beeer es estimator Y =, and of Y,,, Y_,, 
hy Yay Y,,, and (Y wia-bes)e We also ) computed an_estimator 
Y,3, defined to take the value of 1 . ss LY ate ¥5) < 
8%, and the value of Ye eiheavive So, Y,, is a sample 
dependent type anes constructed choosing between 
Ye and Kip 5 according to the estimated value of 4(Y Hs Von) 
Here, 8%. . an arbitrarily chosen threshold, over which 
shifting from ie, to a is thought to be convenient. 

Table | pais for a simulation the empirical results 
achieved with reference to the percent relative bias of 
estimators (RB) and the mean squared error (MSE), in the 
latter case having set that of the Horvitz-Thompson 
estimators equal to 100 by multiplying the MSE values by 
100/MSE(y ). As we can see, the biases are all negligible 
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(the biggest absolute value is less than 0.6% and all biases 
are less than 10% of the corresponding standard errors) and 
contribute to the MSE in a negligible manner. The MSE 
reduction percentages that can be achieved shifting from 
Fi Selo i -) are approximately equal to the fixed in advance 
values of WH, Hey, i.e., 8.1% and 18.6%. The effective 
MSE values ee Y,, and Yo are greater than the corres- 
ponding asymptotic values, in particular when the 
population is asymmetric and the estimator is the optimal 
one. For example, in the third simulation, when n = 20, the 
MSE of Y, shows a 5.1% relative increase compared to 
that of re , while the corresponding value for ve is 10.7%. 
Doubling Ate sample size, those relative values decease to 
2.8% and 3.6%, respectively. As we observed in example 
2, when the sample allocation is proportional, Y 7 18 equal 
to the GREG estimator based on a homoscedastic linear 
model that fits two parallel regression lines in the two 
strata. So, the greater loss in efficiency percentage of Y 3) 
with respect to its asymptotic variance can be explained by 
the added parameter to be estimated in the model. 


The performance of ¥; hs is also interesting; this estimator 
is approximately unbiased and its MSE is lower than that of 
Ki the more often ies is selected. Table 1 reports for each 
siratation the pecentiees of samples for, which 
veers 5) > 8% and Y,, was selected instead of Vabihe 
higher 1 is ‘the theoric, ‘alud of A(Y rp 1,2), the more often : 
Nas is chosen over Vis 

Obucits the pee of Ye depends on, the 
sampling distribution of the sample siaticnes A(Y a Yao 
Table 2 reports the means, the standard deviations, and 
some quantiles of the empirical distributions of Me on i) 
for the gamma populations, which are the more problematic 
ones. As it can be seen, the distributions of 4(Y wrt) were 
in all cases positively skewed and highly variable. This 
means that larger sample sizes than those considered here 
are needed to get reliable net of Moiese Vip! Clearly, 
the less the variance of KY Ae a) ene higher is the gain in 
ya of Nee over Y, Hen the oe value of AY, 

Y ,) is over the threshold for A(Y Y,,) chosen to shift 


le, 
ae iad orn ae) 


Table 1 
Empirical percent relative bias (RB) and Mean Squared Error (MSE) of y, Tak Nae yan es and ee 


and percentage of samples for which AY Es ye) > 8% in the first empirical fates 


Uniform populations 


V(Y|x) = 8x V(Y|x) = 3x 

n =20 n=40 Ti 20) n=40 
Estimator RB (%) MSE RB (%) MSE RB (%) MSE RB (%) MSE 
y -0.06 100.0 -0.08 100.0 0.12 100.0 -0.10 100.0 
ae -0.05 83.8 -0.06 84.1 0.10 69.4 ~0.05 68.8 
Yes -0.03 713 ~0.04 11.7 0.07 56.2 0.01 55.8 
is 0.07 87.7 -0.01 86.2 0.22 3.4 -0.00 70.5 
i, ~0.05 82.4 ~0.04 80.1 0.05 59.8 -0.00 57.3 
yn ~0.06 85.0 ~0.05 83.1 0.03 61.0 -0.01 57.9 

Freq (1 > 8%) 53.5% 53.6% 88.6% 93.5% 

Gamma populations 
V(Y|x) = 8x V(¥ |x) = 3x 

0) n=40 n=20 n=40 
meunec RB(%) MSE RB(%) MSE RB(%) MSE RB(%) MSE 
y 0.07 100.0 ~0.01 100.0 0.02 100.0 -0.03 100.0 
14.) 0.08 84.1 0.02 84.3 0.06 69.8 -0.03 69.9 
et 0.09 71.5 0.05 78.1 0.10 57.1 -0.02 56.9 
a -0.58 88.4 -0.30 86.7 -0.60 75.5 ~0.36 72.8 
igs 0.03 85.8 0.03 80.9 0.12 63.5 -0.02 59.1 
bad -0.05 87.9 0.07 86.2 0.06 65.4 ~0.04 60.8 

Freq (A > 8%) 50.6% 50.3% 86.9% 91.7% 
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Table 2 

Selected characteristics of the empirical distributions of 

AY a Y,,) for gamma populations (first empirical study) 
Gamma Mean Standard Me- Quantiles 
Populations deviation dian 10% 90% 
V(Y |x) =8x, n=20 10.7 9.8 8.7 1:3 24.9 
V(Y|x)=8x, n=40 92 6.3 8.3 Py ISDE 
V(Y|x) =3x, n=20 21.6 123 I EC OG) 
V(Y |x) =3x, n=40 19.0 9.5 13:0lny 9:4 ess 


6.2 The Second Empirical Study 


In the second empirical study, we consider a finite 
population subdivided into eight strata each of size 100, 
according to an auxiliary variable x whose values are 
assumed known for each unit of the population. In order to 
simulate a stratification based on x, the values of x were 
assigned through the monotonic function of / and i 

h-l 
x, = 4.95 +5 jf thei, 
J=1 
where /i is the label of the unit 7 = 1, 2, ..., 
Strate alee Oe 

A finite population of y values, given x, was generated 

using the model 


Y,,=20 + 2x,,+0.06x, + 


100 within the 


hi © hi? 

where €,, is a standard normal random variable. The 
realized values of the mean, standard deviation and 
asymmetry index of y are 618.2, 676.0, and 1.21, 
respectively. The correlation between y and x is 0.96. 

A proportional stratified random sampling without re- 
placement design was used to select 5,000 samples of size 
n = 40 (five units per stratum) and 2,500 samples of size 80 
(ten units per stratum). For each sample we computed the 
following quantities: 


— the unbiased estimator of the population mean Yo bes 
y; 

— the ratio estimator y ru based on the model E’, (Y,,,) = 
Bx,, and V, (Y,,) = 0?x,,, and obtained from (5S) and (6) 
putting x,, =x,, and Vai =Xpj 3 

— the optimal estimator Y »21> based on the same auxiliary 
variable used for Y,1;; 

= is GREG estimator Y,,,, based on the model 

E(Y,,) = +Bx,, and V, (¥,,) =0°x,,, and obtained 
from (5) and (6) putting x,, = “i Sepp) AMG Ms = Kips 

— the optimal estimator Y,,, based on the same cava, 
variables used for Y, 7; 

— the GREG estimator a, ,137 based on the model 
E,, (Y,,) =a + Bx, + yx, and V, (Y,,) = 02x, (the true 
mode) and obtained from 5) and (6) putting 

yom lexis xp) and v,,= xpi 
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the optimal estimator y >, based on the same auxiliary 

variables used for Ya Ube 'omet avery Ue 

— the linear approximations Y,,,, Y,.,, Y,.), 
gy aoe y 12d» | and Ng oe respectively; 

— the statistics RE fe very. fork =e 2; 33 

- the sample dependent estimators Y,,, (k = 1, 
defined to take the value of Y4, when A( 
Y .,) < 8%, and the value of Y_,,, otherwise. 

We do not consider separate regression estimation 
because sample sizes within strata are small. The finite 
population is such that L(Y Pe) aU 22e ACY 45 495)'= 
0.16, and ive Y.3) = 0.00. Note that because of the 
sample design considered we have Y,,, = Y_,, and therefore 
we omit Y ee 

Table 3 reports the empirical results achieved with 
reference to the percent relative bias of estimators (RB) and 
the Mean Squared Error (MSE), in the latter case having set 
that of the Horvitz-Thompson estimators equal to 100. The 
results are separated according to the sample size. 

Again, the biases are all negligible. The MSE reduction 
percentage that can be achieved with respect to the sample 
mean increases with the number of auxiliary variables used. 
However, as expected Y,,, and Y,,, are less efficient than 
the optimal estimator Y,79 Siva) on the same auxiliary 
variables. The statistics UY, seh <A 08 pia ee) 
take values above the 8% threshold most of the time, 
especially when the sample size is 80. The sample 
dependent estimators Y,,, and Y,,, are both more efficient 
than Y,,, and Y,,,. The result is due to the inadequacy of 
the models upon ‘which Y,,, and Y,,, are based for 
extracting all information from the sample. On the other 
hand, Y,,, is more efficient than ies , because it is based on 
the true oUeL Most of the time hie statistic MY, Oe Y 43) 
is below the threshold, especially when the sample size is 
80, and the sample dependent estimator Y 33 1S almost as 
efficient as Y,,, 

Looking at the linear approximations, first we gbserve 
that the MSE's of the GREG estimators Fis anGud cased 
almost equal to those of Y,,, and Y,,, in this second et 
This is not true for the optimal estimators Y,,, and Y,,,. 
The losses in efficiency with respect to their linear 
approximations Y 2) and y 23 are greater, but they diminish 
rapidly when the sample size increases. The MSE’s of the 
linear approximations confirm that given a certain amount 
of auxiliary information, a negligible gain in efficiency can 
be achieved through the optimal estimator, even with very 
large samples (compare Y 13 With Y,,,), when the model 
upon which the GREG is based holds true. Substantial 
gains in efficiency can be achieved if the model is not 
adequate, such as those upon which bd 5 antl Hiway OTe 
based (compare y 12 With Y 77): Fable 4 reports the means, 
standard deviations and some quantiles of the empirical 


distributions of X(Y,,,, ¥,,), = 1,2, 3. 


and Y,,, of 


2; 
Yue 
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Table 3 
Empirical percent relative bias (RB) and Mean Squared Error (MSE) of estimators and percentage of samples for which 
MY, 1,> Y,>,) > 8% in the second empirical study 
Auxiliary ' Sample size 40 Sample size 80 ' 
used SSH RB(%) MSE (A> 8%) RB(%) MSE (A> 8%) 
none y 0.01 100.0 = 0.01 100.0 = 
(x) re -0.01 55.2 82.6% 0.00 54.3 85.0% 
(x) Ys -0.05 48.4 . -0.02 43.8 i 
(En Vag -0.01 cil 72.1% 0.00 50.8 83.2% 
(1,x)! Vx -0.05 47.4 = -0.01 43.3 a 
(yx)! ae -0.05 48.3 - -0.02 43.8 a 
(1, x)’ re 0.02 51.6 = 0.01 50.7 . 
(1,x)’ ve 0.02 44.3 2 0.00 42.3 Z 
(lee Vee -0.01 35.1 28.9% 0.02 33.5 10.5% 
(1x, x2)’ ve -0.10 38.0 . -0.03 34.7 = 
(1, x, x2)’ yay -0.04 37.0 : -0.01 33.8 2 
(1, x, x2)’ ee 0.01 34.9 2 0.03 33.5 = 
(1,x, x2)’ ae 0.01 34.7 2 0.03 232 - 
Table 4 dee 2% 
Selected characteristics of the empirical distributions of A(Y,,,, ¥,,,),& = 1, 2,3 (second empirical study) 
Sample size 40 Sample size 80 
vane Mean anes Median ae Mesa ae Media PE eS 
One on 0.24 0.15 0.23 0.04 0.45 0.23 0.10 0.23 0.07 0.35 
gat ey Oso 0.14 0.17 0.02 0.38 0.18 0.09 0.17 0.04 0.30 
NRT SekOe 0.08 0.03 0.00 0.18 0.03 0.04 0.01 0.00 0.08 


7. DISCUSSION 


The optimal estimator can be an efficient alternative to 
the generalized regression estimator based on misspecified 
superpopulation models when the sample size is large 
enough. This efficiency can be measured by means of the 
sample statistic, AY ao Ve ,2)» that captures the asymptotic 
relative gain in efficiency of Y,, over Y,,, given a certain 
amount of auxiliary information. The performance of the 
optimal estimator appears to be good, even in finite size 
samples, and its use profitable, provided that the value of 
A(Y,,,Y,,) is big enough to compensate for its greater 
instability. In fact, the empirical results confirm a greater 
instability in the optimal estimator, especially with 
asymmetric populations. Further empirical evidence is 
needed to evaluate its stability when the auxiliary variable 
is multivariate and to establish when a sample is large 
enough to overcome the problem. ve hie 

In order to use the information provided by 4(Y,,, Y,,) 
within the same survey, the distributional properties of this 
sample statistic and of the sample dependent regression 


estimator, which seems to perform well in the empirical 
study, have to be studied in more detail. In particular, the 
distribution of Oe Y,,) when its true value is zero will 
be useful for choosing the threshold over which shifting 
from Y,, to Y,, is truly profitable. Besides working with 
larger sample sizes, the instability problem of this statistic 
can be addressed by looking for more stable, consistent 
estimators of the variances and covariances appearing in 
A(Y,,, Y,,). Furthermore, since in most practical situations 
there is more than one variable of interest, in order to apply 
the same weights to all variable, the optimal estimator 
should be chosen on the grounds of an averaged A-measure 
across the main survey variables, and such an average is 


more stable than single A-measures. 
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Combining Multiple Frames to Estimate Population Size 
and Totals 


DAWN E. HAINES and KENNETH H. POLLOCK' 


ABSTRACT 


Efficient estimates of population size and totals based on information from multiple list frames and an independent area 
frame are considered. This work is an extension of the methodology proposed by Hartley (1962) which considers two 
general frames. A main disadvantage of list frames is that they are typically incomplete. In this paper, we propose several 
methods to address frame deficiencies. A joint list-area sampling design incorporates multiple frames and achieves full 
coverage of the target population. For each combination of frames, we present the appropriate notation, likelihood function, 
and parameter estimators. Results from a simulation study that compares the various properties of the proposed estimators 


are also presented. 


KEY WORDS: Incomplete frame; Capture-recapture sampling; Screening estimator; Dual frame methodology; Multiple 


frame estimation. 


1. INTRODUCTION 


In classical sampling theory, it is assumed that a complete 
frame exists. In practice, however, this assumption is often 
violated. Frame imperfections such as omissions, duplica- 
tions, and inaccurate recordings are almost inevitable in any 
large data collection operation (Hansen, Hurwitz and 
Madow 1953). Information collected from list and area 
frames is used to obtain estimates of the unknown popula- 
tion size and totals. For example, an ecologist or wildlife 
biologist may use one list and one area frame sample to 
estimate the number of bald eagle nests in a given region. 
The U.S. Bureau of the Census uses dual system estimation 
to measure decennial census undercounts. Darroch, 
Fienberg, Glonek and Junker (1993) describe a three- 
sample multiple-capture approach to estimating population 
size when inclusion probabilities are heterogeneous. In 
addition, state agriculture officials may be interested in 
estimating the number of hog farms and the total number of 
hogs in North Carolina. Typically, information from 
multiple information sources is combined to estimate 
population sizes and totals. 

List frames are physical listings of sampling units in the 
target population. These are constructed over the years 
using information from scientists as well as city, county, 
state, and federal agencies. Items found on a list frame can 
include, but are not limited to, names, addresses, telephone 
numbers, social security numbers, or physical descriptions 
of location. These and other miscellaneous stratification 
variables are used to identify persons, animals, businesses, 
or other establishments. When estimating the number of 
bald eagle nests in a region, we construct this year’s list 
frame using information from last year’s list frame. With 


the addition of new eagle nests, last year’s list frame 
becomes quickly outdated and incomplete. Because of this 
incompleteness, estimates based solely on list frames typi- 
cally underestimate the true population size. Supplemen- 
ting available information with an area frame sample may 
provide an efficient estimation of the population size and 
totals. 

An area frame is a collection of geographical areas 
defined by identifiable boundaries. The entire area in 
which data are collected is divided into mutually exclusive 
and exhaustive sampling units called segments. The 
segments are usually stratified according to a characteristic 
of interest. Once a stratified random sample of segments is 
drawn, enumerators visit the sampled segments and record 
measurements on all reporting units contained therein. 

The National Agricultural Statistics Service (NASS) 
currently employs a multi-frame approach for its sampling 
and estimation of numerous agricultural commodities. 
Fecso, Tortora and Vogel (1986) provide a review of 
sampling frames for the agricultural sector of the United 
States while Nealon (1984) details the multiple and area 
frame estimators used by the U.S. Department of 
Agriculture. Kott and Vogel (1995) provide a general 
overview of multiple frame surveys. 

In Section 2, we consider estimation based on infor- 
mation from two or more independent list frames. We 
show how these methods are related to capture-recapture 
methods. In Section 3, we consider more efficient estima- 
tors of population size and totals when information from an 
independent area frame sample is available. We extend 
these methods to the case of dependent list frames in 
Section 4. Results from a simulation study that compare 
different estimators are summarized in Section 5. Finally, 


' Dawn E. Haines, U.S. Bureau of the Census, Washington, DC 20233; Kenneth H. Pollock, North Carolina State University, Department of Statistics, Box 
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Section 6 summarizes our results and discusses future 
directions for research. 


2. MULTIPLE LIST FRAMES 


2.1 Population Size Estimation 


List frames used to estimate population size are usually 
incomplete and do not cover the entire population. One 
solution to the incomplete list frame problem is to merge 
two or more incomplete list frames. Combining multiple 
list frames may result in improved coverage of the target 
population, and thus, may provide better estimators. In the 
case of multiple list frames, it is commonly assumed that 
each element in the population has the same probability of 
being included on a given list frame. Hence, the list frame 
elements themselves constitute our “samples.” For 
example, individuals may decide independently whether or 
not to list their telephone numbers in the telephone 
directory with equal probability. In the case of bald eagle 
nests, this year’s list frame is constructed based on last 
year’s nest sightings. If we assume that the probability of 
a nest being sighted is the same for all nests, then the above 
assumption is valid. Finally, the assumption is also valid in 
capture-recapture experiments where the first list frame 
consists of all animals captured on the first sampling 
occasion and the second list frame consists of all animals 
captured on the second sampling occasion. This scenario 
corresponds to Model M, in the capture-recapture literature. 
See Otis, Burnham, White and Anderson (1978) for details. 
Model M, assumes all animals in the population are equally 
at risk to capture on each sampling occasion, but this 
probability can vary over different sampling occasions. 

To begin, we consider the case of two independent list 
frames, B, and B,. Suppose B, has size N, and B, has 
Size Na ap Let Asien b, (6) sie of tose N, (N, 
Permente that belong only to frame B,(B,) and oan 
b, b, contain Np, b, units that belong to Spot frames. The 
“nll domain includes existing target population elements 
that are not included on either list frame. Its size is 
N-N, —~N, ~ Nz,5,- Domain notation for list frames B, 
and B, ‘is aosiiailt in Table 1. Note that every element i Fh 
every frame must be categorized into a domain without 
error. Errors in domain determination are serious and 
cannot be corrected at a later time. These errors are not 
considered in the estimation phase and thus are regarded as 
nonsampling errors. Nealon (1984) claims that domain 
determination is the single largest source of nonsampling 
error in multiple frame designs (Kott and Vogel 1995). 

Let the probability that a population element is included 
on frame B, (B,) be Ps, ( Ps, ). Since list frames B, and B, 
are eimed to be independent, the pronabits of an 
element belonging to domain 5, is Pp, = PB, (1 - pp,). The 
remaining domain probabilities are defined similarly. The 
population size N and the inclusion probabilities Pp, and 


Pp, are unknown parameters. The likelihood function is 


“given by 


L(P, > Pay NIN, 


Table 1 
Domain Notation for List Frames B, and B, 


Domain Size Domain Probability 


Ny, Ps, =P, (1 ~ Pp,) 
N,, Ps, = (Lis Pp, Pz, 
Nb, 6, Pb,b, alas Pp 


NG Ny, 5 N,, ~ Np, 6, 1 Ps, ~ Po, ~ Pb,» =(1 Pz) ~Pp,) 
Maximum likelihood estimators (MLEs) of the frame 

inclusion probabilities are obtained by maximizing the 

logarithm of the likelihood (1). This procedure yields 


DB, = Ma andnipere Mo 
Dp tele a aa (2) 
N N 


where the MLE WN is substituted for NV. Rather than 
differentiating the log-likelihood function to approximate 
the value of N, we employ the “ratio method” of 
maximizing the likelihood which equates &(N) to 
4(N- 1) (Darroch 1958). This process accounts for the 
discrete parameter N and yields the equation 


“A 


APES eth) nod) DeMatha Seat at 3% 
a Ciao 0 erg Te Naiai’adaeny aon) 
(1 - pp) = Bp) = 1. (3) 


Here we assume that NV is large so that 


Na. ang Na. Ne 
IN || N N- 1 N 


Substituting the estimators in (2) into (3) yields 
Np, Np, 


N= N= 
Nb, b, 


(4) 


Sekar and Deming (1949) derive an estimate of the variance 
of (4), given by 
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Substituting (4) into (2) yields the MLEs of Pp, and p By 


The estimator N, of N in (4) is called the Lincoln- 
Petersen estimator in closed population capture-recapture 
models. The elements on list frame B, may be considered 
as the units captured in the first sampling occasion and the 
elements on list frame B, may be viewed as the units 
captured in the second sampling occasion. The elements in 
domain 5, b, correspond to recaptured elements. With this 
correspondence, it is easy to see that the likelihood for the 
population size and capture probabilities for two occasions 
will be the same as that given in (1). Hence, the MLEs 
derived for two independent list frames will be the same as 
the corresponding MLEs for the capture-recapture model 
with two sampling occasions. 

Extending these ideas, we contend that combining k 
independent list frames is directly related to having k 
sampling occasions under Model M, in closed population 
capture-recapture models, where ¢ = k (Otis et al. 1978). 
The general likelihood function for k independent list 
frames, B,, B,, ...,B,, has the form 


LP, PBs N| N eae N,,...b,) i 
N ie Wp N-N 
| | Ultee adlucate0ly weasel) 


which has exactly the same structure as the likelihood 
introduced by Darroch (1958) and is discussed in great 
detail by Otis et al. (1978) and Seber (1982). The form of 
the estimated frame inclusion probabilities is 


Pp = 5 WES a es (6) 


Values of N are obtained by numerically solving the 
(k - 1) degree polynomial in N resulting from the equality 
LN) _ N : 
SiNgoljas (VA N,, == Ny... by) 

(pple psom: (7) 
We then select as N as the root that maximizes the value of 
the likelihood function (5). Substituting this root into (6) 
yields MLEs of the & frame inclusion probabilities. 


2.2 Population Total Estimation 


Suppose the measured y, values are available for all 
units on the & independent list frames. The estimated 
probability that the first element is included on at least one 
of the & list frames is 
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a, = Plu} ,B)|=1- (1 ~ py, )(1 - By)“ - By, 


where 6, =N,/N and N is the MLE of N obtained from 


(7). From equation (7), 
gone prapeedNeenies dod (t4%,)=1 
MexiONy! 28 — IN 


which simplifies to 


An estimated Horvitz and Thompson (1952) estimator of 
the population total is 


= N, +7 +N ~ VGN Bee 
b, Bed, i€B,v...UB, 


where Y ; iS the mean of distinct elements on the list 
frames. Thus, for & independent list frames, the estimated 
Horvitz-Thompson estimator coincides with the population 
total estimator proposed by Pollock, Turner and Brown 
(1994). 

In some situations, values of the variable of interest, y,, 
are not available for all units on the list frames. If the list 
frames are large in size, random samples are selected from 
each list frame and data are collected on those subsampled 
elements. If there are k list frames, it is possible to define De 
domains. We consider an extension of Lund’s (1968) 
estimator for the total of all units on the list frames, 


DEM 


Yor = » NY p> 


which is a weighted sum of 2 - 1 domain means, y,. The 
weights are given by the domain sizes. Further, the 
population total estimator is 


3. MULTIPLE LISTS PLUS AN AREA FRAME 


3.1 Population Size Estimation 


Joining multiple, individual list frames with an area 
frame sample is a solution to overcoming list frame defi- 
ciencies. Assume that the geographical area of interest is 
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subdivided into U, segments. Also, assume that a simple 
random sample of u, segments is selected from U, 
segments that cover the entire population. Therefore, the 
probability of a segment being selected is p, = u,/U,. In 
some surveys, it is possible to subdivide the region into 
approximately equally-sized segments. In such cases the 
segment selection probability corresponds approximately to 
the proportion of area sampled. The inclusion of an area 
frame provides completeness of the target population 
(Hartley 1962). We assume that each reporting unit belongs 
to exactly one segment. Once a segment is selected, all 
reporting units within the segment are observed. For 
example, when estimating the number of bald eagle nests, 
each nest belongs to one and only one segment. However, 
this assumption is not always valid. Consider the case where 
a hog farm crosses segment boundaries. In this case, 
population elements may be associated with more than one 
segment. To address this problem, association rules linking 
population elements to segments are established at the 
estimation stage. See Faulkenberry and Garoui (1991) for 
more detail. The National Agricultural Statistics Service 
implements three correspondence rules that map elements in 
the population to sampled segments. The open, closed, and 
weighted segment estimators are described in Nealon 
(1984). Another related reference is Sirken (1970). 

Consider the case of k independent list frames plus an 
area frame. The population size, N, and the list frame 
inclusion probabilities, pg ,i = 1, ..., &, are unknown para- 
meters. The area frame inclusion probability p, = u,/U,, is 
known. The likelihood function has the form 


Lz. Pps N|P4yNy Nap 07 Mab,...bk? N,> oN, »,) 


N n N-n 
= Die (penn? 


ZN, 
me Diaby 


k 


N-N 
IIp, Np (1 - Pz) an 
I=1 8B; 


where 7, is the total number of elements in the u, sampled 
area segments and n,, is the number of elements in the u, 
sampled area segments which do not belong to oe list 
frames. Similarly, Nab > +++» Nab, by» Np» +1 No,..b, are 
defined as the sizes of different domains. ee iS TaiDeEeet to 
emphasize that the inclusion of an area frame may cause the 
value of N, to change. WN, now corresponds to the 
number of sGiakk on list ite B, which are not in the 
u, selected area segments and not ua any other list frame. 
ihe MLEs of the parameters are given by Pz = =N, / N, 
where N is a solution to the k-th degree polynomial 


N( ~p,)(-p,.)... = bg)= 


(N-n,-n, -- =n -N 
1 


aby...b, Db hiehs NV, Bon (8) 


Numerical methods are essential for solving (8) for the 
MLE N of N. Among the k roots of (8), we select N that 
maximizes the likelihood. 

Applying this methodology to one list frame and one 
area frame, we obtain 


n, 
N=N, +—. (9) 


This estimator is also known as the screening estimator 
(Kott and Vogel 1995). The screening estimator catego- 
rizes elements into two distinct groups. The first group 
contains elements which belong to both the list and area 
frames and is called the overlap domain. Since it is 
assumed that all elements on a list frame belong to the area 
frame, the size of the overlap domain coincides with the 
number of elements on frame B, and has the value N, . 
The second group contains Slementen in the area frame Ee 
included on the list frame(s) and is referred to as the 
nonoverlap domain. The size of the nonoverlap domain is 
an unobserved random quantity, V,. The term 7, is the 
number of elements found in the wu, area segments which 
are not included on the list frame(s) following a specific 
association rule. An estimated value of N, is n,/p,. 
Hence, an estimate of the population size is given by N in 
(9). The resulting MLE of Pz, is 


When multiple list frames are available, it is possible to 
combine them into a single list frame and use the above 
estimator to obtain an estimate of NV. That is, consider the 
screening estimator 


+o +N, + Lie (10) 
at 
Note that the screening estimator N, is appropriate even 
when the list frames are not independent of each other. We 
discuss this further in Section 4. 
Using this methodology for one area and two 
independent list frames yields the likelihood 


Lhe Pay N |p, Ny» Ny» Nyy Nab Nab, Nb, b> Nab, by) e 


N ny MB, pes 
Pa Ps, 4B 
1 2 
My» Ny» Nbo.> Mas,» Nav, Nb,b,> Mab, by 
Mey N-Np, N-Np, 
(Den leads) (1 - Pg) 
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The MLE of NV is 
N, =N=(2p,)' * 


[Vs, *Ma,)Pa* Og Nob, ~ "ab,0,)]* 2P4)” 


(11) 


[\s,* +N, Pat (ta-Ns », -Ns,0,)) +4p,(1- p4)Np Np) 


where 75,5, denotes the number of elements included in 
the u, sampled area segments that belong to both list 
frames. An estimate of the variance of N, may be obtained 
using the Taylor series Syanndneatots of (11) and the 
asymptotic distribution of (V B)? N By Ma Ns, 5,» Nap, b,)- 


3.2 Population Total Estimation 


When y,’s are available for all elements on k indepen- 
dent list frames and for a sample of segments from an area 
frame, we consider an estimated Horvitz-Thompson estima- 
tor to estimate the population total. Recall that we assume 
the following: 


1. The probability that a unit is included on the i-th list 
frame, p, , is the same for all units. 


2. The event that a unit is included on one frame is 
independent of its inclusion on another frame. 


3. The probability that a unit is included in the area frame 

sample of u, segments is p, =u,/U,. 
Since we consider the case where population units belong 
to exactly one area segment and all units within a sampled 
segment are observed, the third assumption is valid. Hence, 
the probability the i-th element is on at least one of the k list 
frames and/or the area frame sample is 

aio pil Spe), )- 70 Sippy 
aes Nab, bbe Np,...b, 
N 

The estimated Horvitz-Thompson population total estimator 
is 


By yy yi =NY_> 
Tait nie Wy b,...b, #€ sample 


where y , is the mean of the distinct elements on list frames 
B,, .... B, and the elements in the area frame sample. 

We can also use the screening estimator to estimate the 
population total. The known overlap domain total is 
combined with an estimator of the nonoverlap domain 
(NOL) total to yield ¥,= ¥, + Yienot ¥j/P4- The NOL 
domain consists of elements on the area frame that are not 
on any of the list frames and Y, = Y, Heo is the total of the 
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distinct units on the & list frames. In the subsampling case, 
we may replace Y, in Y, by Lund’s estimator, given by 
Lp 4 4 


N 


y, + y + +N 
bo Bee by, bs 


BS Bache 


4. DEPENDENT LIST FRAMES 


We now consider the case where dependencies exist 
among list frames but where area and list frames remain 
independent. In capture-recapture experiments, for 
example, the probability an animal is captured on the 
second sampling occasion may depend on whether it was 
captured on the first sampling occasion. See Fienberg 
(1972), Cormack (1989), Wolter (1990), Pollock, Hines, 
and Nichols (1984), Huggins (1989), and Alho (1990) for 
specific examples. 

We consider the case where we have two list frames, B, 
and B,, that are dependent. Let p,, denote the probability 
of being included on both list frames. If B, and B, are 
independent, then p,, = Ps Pp, where Pp, oad Pg, are 
inclusion probabilities for B, and By, respectively. Define 
Pio(Po1) as the Dicbability of nein included on frame 
B,(B,) but not on frame B,(B,). The probability of 
exclusion from both list frames is denoted by poy = 1 - 
Ps, ~ Ps, + Py: 

The likelihood function is given by 


L(P5 Pay Piv N| Py Mq> Ny» Np» Mab, Mab No,6, Mab, 6,) 


N n N-n 
iz NaN ar) - Ps) i 
har Np 2 Nb» Mab,» Mab,» 2%b,b,? ab, b, 
N, +n N,v +n Np + Nap, b 
= l ab) As bo aby 172) 122 
(Pp, Pi) @z, P)) Pi, 
INGUIN pital \ eat pearl apes IN pepe? an 
= = 1 2 at Ci) 12 ae? 
(1 ~ Pg ~ Pp, + Pi) (12) 


Maximizing (12) with respect to Pp» Pay Pi and N leads to 
the approximate solution 


Nase Nes iN ny tilt ust Tap cikl Np app 


which coincides with the screening estimator N,. That is, V 
is also the estimator that is obtained by pooling the two list 
frames into a single list frame where the duplications are 
eliminated and the nonoverlap domain size is estimated 
using the area frame sample. Also, it can be shown that the 
two-stage maximum likelihood procedure of Sanathanan 
(1972) leads to: 
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Thus, the maximum likelihood estimator and Sanathanan’s 
estimator both coincide with the screening estimator. If 
information from two dependent list frames is available and 
the nature of the dependency is unknown, then we cannot 
estimate the individual parameters. When information from 
an independent area frame is available, all parameters are 
estimable. However, for estimating NV, NV, |, is sufficient 
and no additional information is gained from BY N By and 
N, b, by: 

Methods are available for modeling the dependence 
among & list frames when estimating population size and 
totals. Additional population information or information 
from an independent area frame is needed to accurately 
model the dependence. Fienberg (1972) and Cormack 
(1989) consider constrained log-linear models to model the 
dependence. On the other hand, Wolter (1990) uses 
external constraints such as a known sex ratio to estimate 
the population size in the dependence case. Another 
technique used is to model the inclusion probabilities as a 
function of the covariates. Alho, Mulry, Wurdeman and 
Kim (1993) use a conditional logistic regression model to 
estimate the probability of being enumerated in a census 
and apply the model to the 1990 Post-Enumeration Survey. 
The role of auxiliary variables in capture-recapture 
experiments with unequal capture probabilities is addressed 
in Pollock et al. (1984), Huggins (1989), and Alho (1990). 


5. SIMULATION STUDY 


_ We conduct a simulation study to assess the overall 

efficiency of different population size estimators for the 
special case of two list frames plus an area frame. This is 
the most feasible combination of sampling frames for real 
survey problems. 


5.1 Design of the Study 


In order to study both dependent and independent cases, 
we define the parameter 0 that reflects the dependence 
structure between list frames B, and B,. It has the same 
form as the odds ratio and is written formally as 


PooPi1 
Poi Pio 


In the case of two list frames, the value of 8 determines a 
unique solution for p,,. Our study varies the following 
factors: 


6 = 


Factor Levels Definition 
N 500, 5000 Population size 
Py 0.05, 0.10, 0.20 Inclusion probability for area 
frame A 
Pp, (= Pz.) 07.0.9 Inclusion probability for list 
ats frame B, (B,) 
0 0.5, 1.0, 1.5,2.0 Odds ratio 


For each parametric combination, we generate data (7, 
N,,» No, 0 Nagy» Nap No, by Mab, b,): One thousand Monte Carlo 
replications are generated for éach parametric combination. 


5.2. Estimators 


We compare four population size estimators, MN, ine N,, 
and Ne N, is the Lincoln-Petersen estimator eaele eee 
not moon. area frame information. The estimator N, 
is suitable when the list frames are independent. Since the 
estimator ignores information from the area frame sample, 
it is expected to be inefficient when information from an 
area frame is available. The screening estimator, N. , sums 
the overlap and nonoverlap domain estimates and is 
particularly suitable for the dependent list frame case. The 
third estimator, N;, is derived from the full, independent 
sampling frame likelihood function. This estimator exploits 
the information contained in the area and list frames and the 
fact that the list frames are independent (0 = 1). 

We expect N, to be the best estimator when list frames B, 
and B, are Hrrrenttsit whereas we expect N, to be aS 
best bi ter a in the dependent case. As a cule we also 
consider a pre-test estimator that tests for independence of 
the list frames. We define N, to be N, if there is strong 
evidence to believe that Fae B, and B, are not 
independent. Otherwise, we take N, = Ne. Fenn 


xy. M2 if GOF > 10 9.05 = 3.84 


4 “ : 
N, otherwise, 


where GOF is the chi-square goodness-of-fit test statistic 
for testing H): 8 =1 and is derived from the following 
two-way table. 


Figure 1. Classification of Sampled Area Frame Elements 


Figure | categorizes the n, elements according to their 
presence on or absence from list frames B, and B,. 
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5.3 Comparing the Estimators 


Tables 2 and 3 display the percent relative bias and the 
percent relative root mean square error of the estimates 
Ne N, af and N, for population sizes of 500 and 5000, 
respectively. We scale the bias and the root mean square 
error by N in order to directly compare estimators based on 
different population sizes. A comparison of N, with N, 
shows the benefit of drawing an area frame sample. In 
practice, these benefits depend on the relative cost of the 
area frame sample. In this study, we do not take sampling 
costs into account. The probability of being included on 
both list frames, p,,, is given in parentheses in the 0 
column. When p, = Pc = .9, p,, must lie between .8 and .9. 
However, for 0 ranging from .5 to 2, p,, varied only from 
.806 to .817. 

The estimator N, is unbiased for N and has the smallest 
percent relative bis The estimators N, and N, are 
asymptotically consistent for N and yield Biases elore to 0 
when 8 = 1. On the other hand, N, and N, have large 
biases when 6 # 1. The percent relative bias of N, is 
smaller than that of N, but it is not close to zero. The bias 
does not change peniicondy as p, increases from .05 to 
0to220: 

When N =500 and p, = p= .-9, N, has the smallest 
percent relative root mean square error (% RRMSE). This 
is partly due to the fact that the limited range of p,, values 
is similar to the p,, value for the independence case (. 810). 
The % RRMSE for N, is 40 - 50 % smaller than that of N,. 
On the other hand, the % RRMSE of N, is only 15 - 30 %o 
smaller than that of N.. Therefore, When the list frames 
have very high inclusion probabilities, both N, and N, are 
much better than Ne Additionally, if area fae eaine 
costs are high, N, may be a reasonable alternative estimator 
to N,. When N= 500 andp, =p. =.7, N, has the smallest 
% RRMSE for the independence case. When C=, N, has 
the smallest % RRMSE. If N = 5000 and p, =.7, N has 
the smallest % RRMSE for only 9 = 1. For all Giher ) 
values, N, yields the smallest % RRMSE. In all cases, N, 
has very small variance and most of the % RRMSE is due 
to the bias in ie For 6<1, N, tends to have positive bias 
while for @> 1, N, has WS bias. For the case of N= 
5000 and p, = .9, N, has the smallest % RRMSE for 6 = 1. 
N, has the pre lies % RRMSE for 6 =.5 and 2. For 
@=1.5, there is no best estimator with respect to 
J RRMSE. 

As expected, the percent relative root mean square errors 
of N,, N,, and N, decrease as the value of p, increases. 
Thus, as the area frame information increases, the 
% RRMSE decreases. Also, as the population size 
increases from 500 to 5000, the % RRMSE decreases. 
Since the values of p, in our simulation are small, N, has 
a large variance. On the other hand, even though N, is 
biased, it has a very small standard error and results in a 
smaller % RRMSE. The estimator W. , reduces the bias of N, 
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but has a large standard error. Hence, N, is not a 
particularly beneficial estimator. For larger values a Qand p,, 
we expect N, to perform better than N,. For the values of 
8 and p, we considered, we etme N, over other 
estimators. 

The value of % RRMSE for N, is between that of N, 
and N, in most cases. We write the estimator N, as Ni= = 
SN, « b- 5)N,, where 6 = 0 or 1 based on the alts of 
the CR Rice me fit test. The % RRMSE and % RBias of 
N, need not be between those of N, and N, because 6 is 
not independent of N, and N,. 


5.4 Limitations of the Study 


The goal of our study is to compare the bias, standard 
error, and mean square error of four population size 
estimators. We assume that inclusion probabilities for both 
list frames are identical. Future studies may include 
unequal inclusion probabilities as well as larger values of 0. 
Clearly the benefit of N, over N, depends on the cost of 
sampling from an area frame. Our paper considers only 
small values of p,. Small p, values are associated with a 
high area frame sampling cost. Even in this case, we 
observe a significant reduction in % RRMSE and % RBias, 
thereby justifying the use of N, over N,. We do not 
consider an objective function which incorporates sampling 
costs, % RRMSE, and % RBias. 

Throughout this paper, we assume that all units have the 
same probability of being included on a given list frame. 
Haines (1997) considers the case where the inclusion 
probabilities are modeled as a function of a covariate. 
When inclusion probabilities are heterogeneous, larger units 
may have a higher list frame inclusion probability than 
smaller units. Heterogeneous inclusion probabilities play 
an important role in estimating population totals when the 
response variable has a highly skewed distribution or has 
rare values. Haines (1997) also presents two stratification 
procedures that are useful when area and list frames are 
stratified on the same variable. These results will be 
presented in future publications. 


6. DISCUSSION 


The primary focus of this paper is population size 
estimation based on several sampling frames. Information 
from area and/or list frame(s) is collected and combined to 
obtain various estimators. We derive population size 
estimators when information is available only on k 
independent list frames and also when information is 
available on an area frame sample in addition to the list 
frames. We conduct a simulation study to compare the 
performance of the estimators in the special case of two list 
frames plus an area frame. Based on our simulation study, 
we recommend the estimator derived from the full, 
independent likelihood, N. , for the case where the list 
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Table 2 
Simulation Results for NV = 500 


Pa 
05 10 20 

Pp ) %RBias %RRMSE %RBias %RRMSE  %RBias %RRMSE 
J Sec | J, 62.30 66.01 60.64 64.04 63.26 66.81 
wee V, 0.30 49.07 -0.75 32.37 0.85 22.58 
N, 55.52 58.95 48.15 51.15 40.53 43.32 
V, 48.15 58.88 37.88 49.25 24.95 38.80 
1 N, 0.47 19.26 1.01 19.08 -0.11 19.45 
sich V, 0.45 57.34 0.34 39.61 0.88 27.25 
V; 0.43 18.21 0.83 16.93 0.14 15.75 
V, 2.40 2057 1.39 22.94 0.29 17.96 
Seite AN, -35.60 40.06 -36.48 40.58 -35.69 40.26 
a? V, 3.11 66.43 -5.08 41.96 0.30 28.79 
V; -32.07 36.79 -31.01 35.28 -24.04 28.88 
V, -22.74 47.62 -26.21 37.57 -17.06 30.38 
enet Bis -60.07 62.91 -61.31 64.06 -60.41 63.28 
pew V, -6.12 66.59 -1.15 46.68 1.67 30.99 
V; -55.36 58.35 -51.21 54.19 -40.89 43.99 
V, -41.39 63.79 -34.79 55.45 -18.60 41.35 
9 Sioned Ne 5.37 6.79 S21 6.63 5.59 6.97 
C300) V, 0.08 14.78 -0.06 10.17 -0.06 6.55 
V; 5.04 6.44 4.62 5.93 4.24 5.53 
V, 5.94 9.48 5.03 7.05 4.34 512 
1 N, 0.30 5.01 0.17 5.01 0.25 4.94 
ti ee V, 0.78 20.72 0.41 14.06 - 0.06 9.03 
V; 0.33 4.83 0.20 4.68 0.17 4.24 
V, 3.23 13.79 1.88 9.35 1.00 5.98 
1Syy aN -4,29 7.07 -4,39 132 -4.55 het 
One V, -0.65 2152 0.35 15.88 0.002 10.27 
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frames are independent or nearly independent. For the 
moderate to strong dependence cases, we recommend the 
screening estimator, N,,. 

We also study population total estimation. We consider 
two scenarios for estimating population totals. In the first 
case, we assume that observations are available on all units 
that comprise the list frames. In contrast, the second case 
assumes that information is available only on subsamples 
from each of the list frames. We consider an estimated 
Horvitz-Thompson estimator if list frames are independent 
and a screening estimator to estimate the population total if 
the list frames are dependent. 

In this paper, our focus is on population size estimation. 
In practice, one may be interested in estimating population 
totals for several characteristics based on multi-stage 
samples involving unequal inclusion probabilities. 
Relevant papers on this topic include Bankier (1986), 
Skinner (1991), and Skinner, Holmes, and Holt (1994). 
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Temporary Mobility and Reporting of Usual Residence 


NANCY BATES and ELEANOR R. GERBER' 


ABSTRACT 


Temporary mobility is hypothesized to contribute toward within-household coverage error since it may affect an individual’s 
determination of “usual residence” — a concept commonly applied when listing persons as part of a household-based survey 
or census. This paper explores a typology of temporary mobility patterns and how they relate to the identification of usual 
residence. Temporary mobility is defined by the pattern of movement away from, but usually back to a single residence over 
a two-three month reference period. The typology is constructed using two dimensions: the variety of places visited and 
the frequency of visits made. Using data from the U.S. Living Situation Survey (LSS) conducted in 1993, four types of 
temporary mobility patterns are identified. In particular, two groups exhibiting patterns of repeat visit behavior were found 
to contain more of the types of people who tend to be missed during censuses and surveys. Log-linear modeling indicates 
that temporary mobility patterns are a significant predictor of usual residence, even when controlling for the amount of time 


spent away and demographic characteristics. 


KEY WORDS: Temporary mobility; Usual residence; Household rosters; Coverage. 


1. INTRODUCTION 


The fundamental challenge in any census of population 
is the accurate and complete count of every person within 
that population. Consequently, the extent to which people 
are missed or undercounted during a census is arguably the 
most important measure by which it is evaluated. Most 
censuses and household-based surveys begin with a roster 
question designed to list all “usual residents” of a 
household. 

Research evaluating the quality of census data suggests 
that coverage error is a problem. In 1990, the U.S. Post 
Enumeration Survey (PES) and demographic analyses 
estimated that the net national undercount was 
approximately 2% (Hogan 1993; Robinson, Ahmed, Das 
Gupta and Woodrow 1993). Other research suggests that 
coverage error in current surveys (such as the U.S. Current 
Population Survey) is even larger than undercoverage 
occurring during decennial censuses (Shapiro, Diffendal, 
Cantor 1993; Chakrabarty 1992; Pennie 1990; Hainer, 
Hines, Martin and Shapiro 1988). Research by Fein and 
West (1988) and Shapiro et al. (1993) suggest that failure 
to count all persons within a housing unit is a larger 
component of total coverage error than failure to count 
persons as a result of missing a housing unit. Others report 
that within-household omissions account for about one- 
third of all census omissions (Ellis 1994; Fay 1989a). 

Coverage research also indicates that persons who are 
undercounted are not randomly distributed among the 
population. For example, blacks and Hispanics are 
undercounted at a higher rate than non-Hispanic whites 
(4.6% and 4.0%, respectively, compared to 0.7%; Hogan 
1993). Persons who reside in multi-unit structures (such as 
apartments) and those who rent are also more likely to be 


missed (Griffin and Moriarity 1992; Moriarity and Childers 
1993; Ellis 1993): 

This paper concentrates on a dimension long hypo- 
thesized to contribute to within-household coverage error. 
This dimension focuses on temporary mobility into and out 
of a residence over a period of time. Specifically, we 
examine movement in terms of the number of places a 
person may visit, the number of visits he/she makes and the 
amount of time he/she spends there. This analysis examines 
whether or not mobility may be a factor influencing 
coverage and indeed be a good indicator of household 
attachment. We hypothesize that a person’s level of 
mobility tends to influence a household respondent's 
decision when defining that person as a usual resident and, 
consequently, someone he/she would or would not include 
on a census report. 


2. BACKGROUND 


The movement from one geographical location to another 
is usually signified by a change of address, movement of 
possessions and so on. This type of mobility is commonly 
referred to as geographic mobility. In addition to 
geographic mobility, there exists a more subtle form of 
mobility that is not so clearly defined — temporary mobility. 
Defined here, temporary mobility refers to the temporary 
and sometimes patterned movement away from a residence 
and encompasses both long and short, frequent and 
infrequent overnight stays. This type of mobility has been 
described as “one of the key features of irregular and 
complex households” (de la Puente 1993). One example of 
this is found in Haitian immigrant communities where 
typical household structure consists of a relatively 
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permanent “nuclear core’ and a more mobile “fluid 
periphery.” The fluid periphery consists of related and non- 
related newcomers, staying for short periods of time, and 
members of the household who visit Haiti on a regular basis 
and can be away weeks or months at a time (Wingerd 1992). 

Temporary mobility is not limited to special commu- 
nities. Many examples can be found in the wider commu- 
nity, including mobility associated with long term business 
or vacation travel, attendance at college, custody situations, 
and persons who maintain a presence in one or more 
households over a given period of time. This mobility in 
the fluid periphery, or temporary mobility, differs from 
geographic mobility because it consists of movements away 
from, but usually back to, a single residence over time. 
Members of this fluid periphery present conceptual 
difficulties for respondents in identifying which members 
to include in a census or survey. Movement of these 
persons may not involve a permanent change in address, 
and thus can blur the concept of who is defined as living or 
staying at a given address. 

Given that there is little literature on temporary mobility, 
studies on geographic mobility and household structure 
provide a good starting point for forming our hypotheses 
about temporary mobility. According to the March 1994 
Current Population Survey, young adults between 20-24 are 
reported to have the highest rates of geographic mobility, 
with one-third having moved between March 1993 and 
March 1994. Differences by race are also evident with a 
higher rate of mobility among blacks and Hispanics (19.6% 
and 22.4%, respectively) compared to whites (16.0%, see 
Hansen, 1994). Finally, tenure is also closely correlated 
with geographic mobility — renters were four times more 
likely than homeowners to have moved between 1993 and 
1994. Obviously, these geographic movers share many of 
the same characteristics as some undercounted populations. 

The kind of mobility with which we are concerned may 
also be a reflection of socioeconomic status. Temporary 
mobility, transitory situations, and peripheral connection to 
households can represent a means of adjusting for a lack of 
resources (Lipton and Estrada 1993). Hudgins and Holmes 
(1993) suggests that the undercounting of young black 
males is a result of their social and economic marginality 
evidenced in part by a lack of stable residences and 
relatively permanent mailing addresses. One facet of this 
may involve temporary movement to extended families or 
“kin” networks in order to receive family or financial 
assistance. This phenomenon of extended or kin networking 
among blacks has also been documented extensively by 
ethnographic studies (Martin and Martin 1985; Stack 1974; 
Hainer ef al. 1988). These living arrangements suggest 
nontraditional (or at least non-nuclear) household forma- 
tions which could contribute to coverage error, especially 
if a person participates in kin networks by moving back and 
forth among them. 

Finally, Montoya (1992) describes a very different 
household composition that is characteristic of some recent 


Hispanic immigrant communities. Like kin-network house- 
holds, they contain people who come and go, however, the 
members are “loosely tied, ephemeral, and alienated” and 
often composed of young migrant men who work and sleep 
in different shifts and have virtually no social ties with one 
another. Several other ethnographers have identified 
similar households in other Hispanic communities across 
the United States (Velasco 1992; Mahler 1993; Romero 
1992.) They found that census coverage in such households 
was Often restricted to those individuals who were actually 
present when the enumerator arrived. 


3. METHODOLOGY 


Data for this analysis come from the Living Situation 
Survey (LSS), a survey specifically designed to gather 
information about household membership, social attach- 
ments, mobility and the assignment of usual residence. The 
LSS was a voluntary survey conducted by the Research 
Triangle Institute (RTI) and sponsored by the U.S. Census 
Bureau between May and September of 1993. The sample 
was stratified to oversample for high and medium minority 
areas (i.e., greater than 80% black or Hispanic, between 
40% and 80% black or Hispanic) and areas containing 
renters (i.e., greater than 40% renters). To increase the 
efficiency of the sample design, RTI used housing unit data 
previously collected from a multistage probability sample 
used in the 1992 National Household Survey on Drug 
Abuse (NHSDA). 

The first portion of the LSS interview was conducted in- 
person with the most knowledgeable household respondent, 
in most cases, the householder (by U.S. Census Bureau 
definition, this refers to the person in whose name the house 
is owned or rented). These householders provided a roster 
and then answered demographic questions for themselves 
as well as all other listed persons. Through a series of 
13 extensive roster probes, the questionnaire rostered 
“core” household residents but also included many persons 
having a less permanent presence. Persons with a more 
tenuous attachment were brought in by asking probes about 
who had spent the night there during the reference period, 
who was considered a household member even if they were 
staying elsewhere, and who considered the residence their 
permanent address or a place they received mail or phone 
messages (see Sweet 1994). (The length of the reference 
period varied depending upon the date of the interview. 
References periods began on the first day of the month two 
months prior to the interview month and ended on the day 
of the interview. Accordingly, interviews conducted toward 
the end of the month had a longer reference period than 
interviews conducted near the begining). In total, 999 
households were interviewed nationwide. Using the broad 
rostering technique, a total of 3,549 people were listed. 

The next step in the survey was to weed out rostered 
individuals determined to be only “casual visitors” to the 
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household. Individuals were defined as casual visitors if: 
|) their usual residence was considered by the householder 
to be someplace other than the sample housing unit and 2) 
they had stayed at the household for one week or less 
during the reference period. This screening process 
identified persons from the broad rostering technique who 
had only a casual attachment to the household. Of the 
3,549 persons rostered, 712 were considered to be casual 
visitors. (Of the 712 casual visitors, 77% were related to 
the household respondent, 93% were non-Hispanic, 84% 
were white and 58% were female). For several reasons, 
casual visitors were ineligible for the remainder of the 
questionnaire. First, we assumed that casual visitors do not 
meet the Census Bureau definition of a usual resident at the 
interview household and second, excluding this group from 
the bulk of the questionnaire greatly reduced the time and 
resources required to carry out the survey. 

After follow-up for converting refusals and other non- 
interviews, the final response rate for the household-level 
portion of the interview was 79.5%. (Follow-up actions 
included sending refusal conversion letters, having field 
supervisors call directly, make repeat visits, and re-assign 
interviewers. Respondents were contacted an average of 1.9 
times; nonrespondents an average of 5.9 times). Consider- 
ing the population, this was considered to be an acceptable 
rate of response. Nonetheless, since we suspect that nonre- 
sponse is highly related to coverage issues such as mobility, 
it is likely that this level of nonresponse has some effect upon 
our estimates. More discussion on this is included in the 
description of the individual questionnaire below. 

The next part of the survey was a self-reported 
individual-level questionnaire. This part of the survey 
contained questions about temporary mobility as well as 
self-reported demographics. Respondents were asked if 
they had stayed overnight at any other place beside the 
interview household during the reference period. If so, 
interviewers used a calendar to record each place and the 
dates stayed. Interviewers also gathered information about 
the type of each place stayed, the individual’s attachment to 
each place, and the reason(s) for going there. 

Each of the householders answered the individual-level 
questionnaire for himself/herself. Additionally, all rostered 
persons who had stayed away for eight or more nights 
during the reference period answered the individual-level 
questionnaire. All persons identified as college students 
and persons with no usual residence were also eligible for 
an individual interview. Finally, the individual question- 
naire was also given to a simple random 10% sample of 
LSS households. Within these households, individual 
interviews were attempted with each person on the roster, 
with the exception of casual visitors. This somewhat 
complex selection criterion resulted in a base of persons 
representing people with a greater-than-casual association 
to the interview households, all of whom are included in the 
analyses reported below (N = 1,451). 
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The individual-level portion of the questionnaire had a 
response rate of 85.3%. The majority of individual inter- 
views were conducted in-person (96%) and most of the 
adult interviews (89%) were self-reported while all inter- 
views with children were conducted by a knowledgable 
proxy. Because the householders answered basic living 
situation questions and demographic questions for ail 
rostered individuals, we had some means for examining the 
characteristics of the approximately 15% who were selected 
for the individual questionnaire but did not respond. We 
found no significant sex or age differences between 
nonrespondents and respondents but we found that a 
disproportionate percentage of nonrespondents were black. 
We also found that nonrespondents were more likely to 
have spent more than one week away from the interview 
household than respondents. These findings shed some 
light on how representative our individual sample is both 
demographically and with respect to temporary mobility. 
Because nonrespondents were reported to be away more 
than respondents, we suspect the potential ‘selectivity’ bias 
may have underestimated our mobility measures. 

Household and individual-level weights were applied to 
adjust for the oversampling, the selection criteria for the 
individual-level survey and for nonresponse (see Lynch, 
Witt, Branson and Ardini 1993). All analyses were 
conducted using Contingency Table Analysis for Complex 
Sample Designs (CPLX), a computer variance estimation 
program designed to adjust for the LSS’s complex sample 
design effects (see Fay 1989b; 1985). 


3.1 Typology of Temporary Mobility 


The typology which we present is empirically based. 
That is, the particular groupings of visits and destinations 
was derived analytically and not theoretically. Therefore, 
the categories we identify do not represent groups of persons 
with identical characteristics or in identical circumstances. 
Rather the typology should be regarded as an attempt to 
represent the complex underlying reality involved in mobile 
living situations. It is our hypothesis that such mobility has 
an affect on the strength of the social tie between an 
individual and a particular household, and that these ties 
influence the judgment of the household respondent in 
deciding who is a usual resident of the household. Time 
away, number of visits and number of destinations are an 
indirect measure of the strength of such ties. 

Our typology of temporary mobility was created using 
two dimensions of overnight movement outside the 
interview household. The first dimension taps into the 
variety of places a person visited over the reference period. 
This provides some idea of how many places other than the 
interview household that a person might have attachments 
to. The second dimension taps the frequency of movements 
outside the interview household by counting the number of 
times a person left for a period of one or more nights. 
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The use of these factors as a measure of the strength of 
attachment to a household is confirmed by ethnographic 
descriptions of highly mobile living situations. The pattern 
of movement represented in our typology reflects many 
different social processes, such as dispersed attachment to 
extended kin households (Stack 1974; Dressler, Hoeppner 
and Pitts 1985), immigration patterns (Wingerd 1992), and 
adaptation to poverty (Hainer 1987; Valentine and 
Valentine 1971). 

The LSS included several exploratory open-ended 
questions designed to examine respondents perception of 
the reasons for their mobility. The questions asked the 
reasons for going and reasons for return for particular trips. 
We had hoped that these questions would provide us with 
a more direct assessment of the underlying social patterns 
that cause temporary mobility. Unfortunately the answers 
to these open ended questions were difficult to code without 
making unwarranted assumptions, largely as a result of the 
way in which they were expressed. As a result, we did not 
incorporate these reasons when formulating the typology. 

Each “move” was defined as a stay made outside the 
interview household for at least one night. For example, if 
a person left to spend three days at a girlfriend's, then 
moved from there to a relative’s for one night before 
returning to the interview household that person would be 
assigned as having two total places with two total visits 
(one visit apiece). Conversely, if a person left to stay 
overnight at a friend's then returned to the household and 
then two weeks later returned to the same friend’s home for 
a second visit, that person would be assigned one place with 
two total visits (two repeat visits). The first example 
exemplifies a potential bias in this method, that of counting 
each unique place visited during one extended trip outside 
the interview household as an independent move (such as 
a vacation with multiple destinations). On the other hand, 
this method also captures the movement of “floaters” by 
counting each separate place visited during one move away 
from the household as a separate move. 

A single mobility measure using various combinations of 
the number of places and number of moves was 
constructed. In all, five categories were created with efforts 
made to identify different patterns of movement by 
separating out those making repeat visits to the same places. 
Our first category depicts persons who stayed all nights of 
the reference period at the interview household and 
represents persons with no temporary mobility (the “Non- 
mobile”). The second category consists of persons who, 
according to the calendar, reported only one visit to one 
place (the “1-shots’”). The “Boomerangs” reflect persons 
making repeat visits to one place only. The “No-repeats” 
are characterized as persons who traveled to more than one 
place, but never the same place twice. And finally, the 
“Floaters” stayed overnight at several different places, 
making repeat visits back to at least one of these places (see 
table 1). 


Table 1 
Temporary Mobility Typology 


Number Number of Visits 

of Places 

Visited 0 1 2 3 4 
0 


Boomerangs Boomerangs Boomerangs 


0 Repeats Floaters Floaters 


o Repeats Floaters 


No Repeats 


4. CHARACTERISTICS OF MOBILITY TYPES 


Table 2 presents the weighted frequencies for the 
mobility typology. Slightly more than half of the persons 
administered the individual questionnaire reported no 
mobility outside the interview household during the 
reference period. The largest concentration of persons who 
were mobile fell into the 1-shot category, that is, they 
reported making only one move outside the interview 
household to one place (26%, overall). Eleven percent 
comprised the Boomerang category reporting a more 
repetitive pattern of two or more visits to a single place 
while 7% reported the less patterned, yet highly mobile “No 
repeat” category. The Floaters comprised the smallest 
group with 4%. 


Table 2 


Typology of temporary Mobility by Sex and Hard-To-Enumerate 
(HTE)* Status (Weighted % and standard errors) 


Total 


‘ SEX HTE STATUS 
MOBILITY eee rstl 
TYPE ne 
(sein = MALE FEMALE NON-HTE HTE 
paren.) 
Non-mobile 52% 40% 67% 53% 38% 
(14.0) (13.7) (13.6) (14.3) (7.8) 
1-Shots 26% 35% 16% 27% 6% 
(10.4) (13.9) (7.0) (10.6) (2.9) 
Boomerangs 11% 15% 6% 10% 21% 
(4.0) (5.7) (2.9) (4.1) (9.1) 
No Repeats 7% 6% 8% 71% 6% 
(2.9) (2.4) (4.3) (3.0) (5.4) 
Floaters 4% 4% 3% 3% 29% 
(1.0) (1.3) (1.3) (0.9) (9.9) 
Unweighted N 1,451 653 798 1,375 76 
‘ X°? for distribution ex- 
Jackknife X*=2.03,p<.05, cluding non-mobile 


chi-square** category =2.14, 


CoE LOS a ae 


* The hard-to-enumerate group includes black and Hispanic males 
aged 18-29. 

**See Fay 1985 for documentation of Jackknife chi-square test for 
complex samples. 
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Tables 2 also illustrates selected demographics for the 
five mobility categories including gender breakouts which 
illustrate a higher mobility propensity for males than 
females. Approximately 60% of the males reported at least 
one visit outside the interview household, which was 
significantly higher than females at approximately 33%. 
This gender difference in temporary mobility is much more 
pronounced than in geographic mobility where the 
difference between the male and female move rate is only 
around 1% (17% of the male population moved between 
1993 and 1994 compared to 16% for females, see Hansen 
1994). This suggests that temporary mobility is more 
common than geographic mobility and that the demo- 
graphic characteristics associated with it are different as 
well. Military travel could explain the gender differences 
in temporary mobility, as could travel for business with 
males having a higher active-duty/population ratio and 
employment/population ratio compared to females (U.S. 
Department of Labor 1994). 

The right side of Table 2 integrates several demographic 
characteristics to create a subgroup known to have high 
rates of undercount in previous censuses. This group is 
comprised of males between 18 and 29 who are black or 
Hispanic. This subgroup is sometimes referred to as the 
“hard-to-enumerate” or HTE population. Only a small 
percentage of the LSS sample met the HTE criteria, but an 
examination of this group’s mobility reveals very different 
patterns compared to the non-HTE group. 

First, the HTE group appears more mobile to begin with — 
over 60% indicated spending at least one night someplace 
other than the interview household compared to less than 
50% for non-HTEs. Second, the distribution of mobile 
categories differs significantly by HTE status. The majority 
of non-HTEs who are mobile are concentrated in the 1-shot 
category whereas the HTEs who are mobile are more 
concentrated in the repeat movement categories (Boomer- 
angs and Floaters with 21% and 29%, respectively). 

We also examined the distributions for temporary 
mobility by race (white, black, Hispanic, and other) and age 
(0-17, 18-29, 30-49, 50+). Overall, temporary mobility did 
not vary significantly by either, yet some interesting trends 
were noticeable. A relatively large concentration of 
Hispanics were found in the No-Repeat category (19%) and 
blacks in the Floater group (9%). A higher percentage of 
blacks were Non-mobile (66%) compared to whites (52%), 
in spite of the fact that blacks have higher rates of 
geographic mobility than whites. Finally, young adults 
between 18 and 29 appeared more mobile than other age 
groups (close to 70% of this age group spent at least one 
night away from the interview household) and a dispro- 
portionate percentage of this group were Floaters (14%). 
The lack of statistical significance among some of these 
trends may be an artifact of sample size.. Alternatively, 
temporary mobility may be sufficiently different from 
geographic mobility such that it does not share the same 
characteristics of traditional ‘movers’. 
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Another important variable hypothesized to correlate 
with the pattern of temporary mobility is the amount of time 
spent away on visits. The U.S. Census Bureau residence 
rules vary in the use of time as a criterion for usual 
residence. For example, persons who work in another city 
during the week but return home on weekends are to be 
counted at the place where they “live and sleep” the 
majority of the time — in this case, at the place they live 
during the week. However, a child living away at boarding 
school is to be counted at the parent's residence even though 
he/she probably spends the majority of time at the school. 
Likewise, a person staying at a group quarters on Census 
Day (e.g., a college dorm or a jail) is counted at that place, 
regardless of their living situation the rest of the year. 
Gerber (1994) found that respondents also use time to 
varying degrees when defining household rosters — in 
certain situations, she found no clear relationship between 
being rostered and the amount of time spent at a place. 
Instead, things like household membership and relationship 
seemed to factor more heavily in the decision-making 
process. 

Nonetheless, it makes intuitive sense that the amount of 
time spent away plays some part in the householder’s 
determination of where to count someone. In order to see 
how our mobility categories varied in term of length of time 
spent away, the sum of the total number of nights spent 
away during all visits in the reference period was divided by 
the total number of nights in the reference period and then 
expressed as a percentage. Table 3 presents this time 
measure expressed in terms of being away more or less than 
half of the reference period. 


Table 3 
Time Spent Away from the Interview Household during the 
Reference Period (Weighted % and standard errors) 


Away 50% of 


Hacicte 1-Shots Boomerangs No Repeats Floaters Total 


< 94% 73% 98% 63% 88% 
p (4.4) (11.5) (1.4) (10.3) (3.6) 
a 6% 271% 2% 37% 12% 
oe (4.4) (11.5) (1.4) (10.3) (3.6) 
UnweightedN 314 186 101 134 935 


Jackknife chi-square = 1.71,p<.05, af. =3 


Both the Boomerangs and Floaters were more likely than 
other groups to spend half or more of the reference period 
someplace other than the interview household. This 
supports the notion that the repeat visit patterns underlying 
these two groups are associated with an increase in total 
time spent away. It also suggests a higher degree of resi- 
dential ambiguity especially for the Floaters. Since 
members of this group report visits to at least two places in 
addition to the interview household, it is unclear whether 
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those away more than half the time are spending a majority 
of time at any one place. If time spent at each place is 
roughly equal, it is easy to imagine Floaters not being 
rostered at any of them or at more than one of them. 
Conversely, by definition we can assume the Boomerangs 
who were away more than half the reference period spent 
the majority of their time at the only other place they 
reported visiting. Assuming time plays a role in defining a 
sense of household membership, then presumably, the 
Boomerangs have a better chance of being counted because 
the majority of their time is being spent at the other place. 


5. USUAL RESIDENCE AND MOBILITY 


We next explored whether temporary mobility has an 
impact on the household respondent’s determination of a 
person as a “usual resident”. On the 1990 U.S. census 
form, respondents were instructed to list persons at the 
place where the person lives or sleeps most of the time. 
The LSS asked household respondents whether they 
considered the interview household to be the “usual 
residence, that is the place where [you/NAME] live(s) and 
sleep(s) most of the time”. They were also asked to report 
whether “[you/NAME] have a usual residence somewhere 
else?” While this method is not a perfect replication of a 
census roster it provides an approximation of who, out of all 
those rostered during the LSS, the householder might 
naturally have included or excluded on a census form or 
current survey. 

Table 4 presents a cross-classification of usual residence 
assignment by mobility status. A combination of the usual 
residence questions resulted in four classification 
possibilities: usual residence at the interview household 
only, usual residence at someplace other than the interview 
household only, usual residence at both the interview 
household and another place, and usual residence at no 
place. (The category of “no place” was extremely small 
(less than 1%) and was combined into the category of 
“other place”). Assuming that answers of “other place” 
equate to being left off the census form, we see that overall, 
only around 4% of persons with a greater-than-casual 
association to the interview households might have been 
left off. Overall, the distribution of usual resident classifi- 
cations significantly differed according to mobility type. 

As might be expected, nearly all of the persons who 
spent every night at the interview household during the 
reference period were considered usual residents there 
(rounded to 100%). The most obvious deviation among 
categories is noticeable for the Boomerangs and Floaters. 
Between 20-25% of the people in these two groups were 
characterized by household respondents as usual residents 
someplace other than the interview household. This looks 
very different from both the 1-shots and No-repeat groups, 
where only 2% and 5%, respectively, were considered usual 


residents someplace else. These results suggest that the 
latter two groups typify mobility associated with pleasure or 
business but for persons with a firm tie to the household 
while the Boomerangs and the Floaters are more likely to 
include persons with a less-established association to the 
household. For this reason, and the fact that a sizable 
percentage of the HTE population were found in these two 
categories, the Boomerangs and Floaters arguably have the 
more interesting coverage implications and raise several 
questions. For example, do these persons get counted at 
one place, all places or no place? Additionally, where 
should they be counted? 


Table 4 
Where Does Household Respondent Consider Person to be a 
“Usual Resident” ? (Weighted % and standard errors) 


Where Usual Non 


1-Shots Boomerangs Floaters Total 


Resident ? Mobile Repeat 
Interview HH 100% 97% 1% 95% 10% 95% 
Only (O2)'me Ore hd) (4:2) 0.0)" Oe?) 
Some Other 0% 2% 25% 5% 2% 4% 
Place (RR) UNIO) (42 Neola) 
0% 1% 4% 0% 10% 1% 
Bocas Sees (2.1) (-) (7:3) OS) 
UnweightedN 716 314 186 101 134 ‘1,451 


Jackknife chi-square = 2.79, p<.05,df. =8 


That a relatively large percentage of the Boomerangs and 
Floaters are considered residents some place other than the 
interview household suggests the potential for undercount- 
ing. On the other hand, 10% of the Floaters are defined as 
usual residents at both the interview household and another 
place suggests potential for overcoverage. The weighted 
number of Boomerangs and Floaters in these uncertain 
residency situations (usual residents elsewhere or at both 
places) represent approximately 4% of the total population. 
From this more global perspective, it seems that a non- 
trivial segment of the population is at risk of some type of 
coverage error. 


6. MODELING OF USUAL RESIDENCE 
AND MOBILITY 


Our final section statistically models the household 
respondent’s determination of usual residence. This analysis 
goes beyond the descriptive findings of the typology to 
explore whether mobility impacts the householder’s 
conceptualization of residence. The assignment of usual 
residence by the householder served as the dependent 
variable in a series of models. The dependent variable 
consisted of two categories: 1) usual resident at the inter- 
view household and 2) not a usual resident at the interview 
household. Persons considered to have a usual residence at 
both the interview household and another place were put 
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into the first category. Predictor variables included age, 
sex, race, time away, and the mobility typology. The final 
models reported in Table 5, all of which include terms for 
the interaction of the independent variables, are equivilent 
to logit models for usual residence. 

The first model tested mobility as a dichotomous 
measure: those with no mobility (the Non-mobile) and those 
having spent at least one night away from the interview 
household (the 1-shot, No-Repeat, Boomerang and Floater 
categories combined). This model established first whether 
temporary mobility was a significant predictor of residency 
status regardless of the mobility pattern exhibited. This 
“first-cut” was necessary because approximately 50% of the 
sample fell into the Non-mobile category and second, 
because the Non-mobile group was extremely skewed 
toward the usual resident category of the dependent 
variable. Consequently, models that attempted to include 
all five categories of the mobility typology were misspeci- 
fied due to a large number of zero fitted cells. 
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Results from the model with the dichotomous mobility 
measure and sex yielded a relatively good “fit” of the data 
(Jackknife X° for overall goodness of fit = .28, df. = 2, 
p=.27. Neither race nor age improved the fit. Parameter 
estimates indicated that persons in the Non-mobile category 
were more likely to be classified as usual residents than 
those having some mobility (not shown). 

Having established that mobility was significantly 
related to residency status, we next explored whether the 
pattern of temporary mobility was a predictor. First, we 
tested an independence baseline model to predict usual 
residence (U). The predictors consisted of a mobility 
variable (M), sex (S), and the amount of time spent away 
(T). The mobility variable was comprised of the four 
mobile categories (1-Shots, No-Repeats, Boomerangs, and 
Floaters). Amount of time spent away was split into two 
categories: less than half the reference period and half or 
more of the reference period. Race and age were excluded 
since neither improved the fit of the data. 


Table 5 
Goodness-of-Fit Tests and Parameter Estimates for Log-Linear Models of the Effect of Sex (S), Temporary Mobility (M), and 
Length of Time Away (T) on Determination of Usual Residence Status (U) 


A. Goodness of Fit Test 


(U) Usual Residence Status 


Model 


af. Chi-square * P 
1. U, SMT 15 4.79 .00 
2. US, UM, UT, SMT 10 1.06 a2. 
3. UTM, USM, SMT 4 0.78 16 
B. Parameter Estimates, Model 3 
beta Sees std. value 

(M) MOBILITY: 

1-Shots 1.08 .40 DTS 

Boomerangs -1.54 39 -3.94" 

No-Repeats 83 58 1.43 

Floaters = BS 47 ~.80 
(S) SEX: 

(Males) 39 Di 1.44 
(T) TIME AWAY: 
(> “ref. period) -1.78 Sy -6.52° 
(U)*(S)*(M) INTERACTION (Males) 

1-Shots -.64 43 -1.48 

Boomerangs .69 58 1.18 

No-Repeats 85 .62 ey) 

Floaters -.90 42 -2.14" 
(U)*(M)*(T) INTERACTION (> % ref. period) 

1-Shots =, 1/7) 48 =i. 

Boomerangs =1:20 54 -2.26° 

No-Repeats 1.57 74 weil 

Floaters 36 4] 0.88 


* Jackknife Pearson chi-square for overall fit. 
” Significant at the .05 level. 
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The baseline model (U, SMT) did not fit the data well so 
we rejected the null hypothesis that assignment of usual 
residence is independent of mobility pattern, sex, and 
amount of time spent away (Jackknife X* overall goodness 
of fit = 4.79, df, = 15, p= .00, see Table 5). We then fitted 
a main effects model (2) which includes the additive effects 
of S,M and Tupon U (US, UM, UT, SMT). This model 
yielded a good fit (Jackknife X overall goodness of fit = 
1.06, df = 10, p.=.12). Lastly, a model (3) including two 
interaction terms was also fitted (UTM, USM, SMT). This 
model assumes interactive effects of T*M and of S*M on 
U. A comparison between the main effects and interaction 
model suggested that several interactions were significant 
and should be retained (comparison Jackknife X 2=1.99, 
df. =6, p= .02). Table 5 contains the overall goodness of 
fit tests along with the parameter estimates from the best 
fitting interaction model (UTM, USM, SMT - Jackknife 
X? overall goodness of fit = 0.78, df. = 4, p= .16.) 

The parameter estimates from Table 5 illustrate that 
temporary mobility has a significant main effect on 
assignment of usual residence in model 3 which controls for 
sex, amount of time spent away, and several interactions. 
Two of the mobility categories had significant beta 
coefficients albeit the directions were opposite. The 1-Shots 
were significantly more likely to be defined as usual resi- 
dents (b = +1.08). Conversely, the Boomerangs had a 
negative parameter estimate (b = -1.54) meaning that the 
odds of being defined a usual resident were significantly 
decreased for this group. 

Time spent away from the interview household had by 
far the largest effect on predicting usual residence with a 
strong negative association (b = -1.78). This means that 
for our temporarily mobile population, those away half or 
more of the reference period were significantly less likely 
to be considered usual residents than those away less than 
half of the time. Sex did not have a significant main effect, 
but was involved in a significant interaction. The inter- 
action appears in the Floater group where male Floaters 
were less likely to be categorized as usual residents than 
female Floaters (6 = -.90). Further investigation revealed 
few clues to explain this finding. Male and female Floaters 
differed little in the types of places they visited, their 
reasons for visiting, and the relation to the householder of 
places they visited (relative versus non-relative). Perhaps 
the interaction reflects differences in other social attach- 
ments such as presence of children, personal belongings, 
and/or contribution of resources. 

The bottom of table 5 indicates that the interaction 
between usual residence, mobility and amount of time spent 
away is rather complex. The amount of time spent away 
appears to affect usual residence status for some types of 
mobility but not for others. The interaction coefficient is 
significant and negative for the Boomerangs (5 = - 1.20). 
Thus, the odds of being defined a usual resident are even 
lower for Boomerangs away half or more of the reference 
period compared to other groups away for a similar amount 


of time. This suggests that persons who “boomerang” back 
and forth between two households will be considered usual 
residents at the place they spend the majority of time. 
However, for the No-repeats, the coefficient is significant 
and positive, essentially canceling out time away’s negative 
main effect (1.57 + -1.78 = -0.21). For this group, the 
amount of time spent away appears to have no association 
with usual residence assignment. Apparently, factors other 
than time may be more important in the cognitive process of 
determining where these persons “reside.”” One hypothesis 
is that No-repeaters are persons who must travel for a living 
and who, despite their frequent mobility and long periods 
away, Clearly “belong” to a stable residence. This notion 
supports findings from a vignette study that found 
respondents did not require a stated rule to be able to 
correctly identify the usual residence of persons described as 
being away on business travel. Such persons were 
“intuitively” perceived to be part of the households from 
which they were away (Gerber, Wellens and Keeley 1996). 


7. CONCLUSIONS 


Temporary mobility, as defined in our research, involves 
long and short, frequent and infrequent, patterned and 
unpatterned movement away from, but often back to, a 
single residence. Such mobility has long been hypothesized 
to contribute toward census and survey coverage error by 
blurring the concept of who exactly lives or stays at a 
particular household. 

Our sample of persons having a more-than-casual 
association to households indicated a fair amount of tempo- 
rary mobility over a two-three month period. Interesting 
demographic differences were noted in the level of mobility 
as well as the pattern of mobility reported. The “hard to 
enumerate” (HTE) group (black/Hispanic males between 18 
and 29) were found to cluster in the Boomerang and Floater 
groups, suggesting a repeat pattern of temporary mobility. 
We suspect these groups include persons having strong 
attachments to multiple households, for example an adult 
son who splits time between a parent and girlfriend’s or a 
young mother who stays periodically at different kin- 
network households to receive assistance with child care. 

Besides the inclusion of the types of persons who tend to 
be missed in censuses and surveys, other considerations 
point to the Boomerang and Floaters as being of particular 
interest. First, compared to the other mobility categories, 
these groups spent a longer time away from the households 
in which they were “found” and second, were more often 
classified as having a usual residence someplace other than 
the household in which they were found. It is difficult to 
estimate how much this type of mobility contributes toward 
undercounting. However, it is very noteworthy that half the 
HTE population fall in either the Boomerang or Floater 
group. It seems more than a coincidence that such a large 
segment of this population belong to one of the two mobility 
groups most easily labeled “residentially ambiguous.” 
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The log-linear analysis suggests that there is not a 
clearcut, simple relationship between temporary mobility 
and assignment of usual residence. We do not find that the 
greater the amount of temporary mobility the less the 
chance of being defined a usual resident. Instead, the 
relationship seems more driven by the pattern of movement. 
For example, the traveling salesman or truck driver who 
reports the greatest variety of places visited and the largest 
number of visits may, nonetheless, have less residential 
ambiguity than a person visiting only one other place but 
making many repeat visits. And, in fact, this proved to be 
the case for the No-Repeats for whom the amount of time 
spent away had essentially no relation to usual residence 
assignment. 

Our exploration of temporary mobility represents a new 
research direction for the study of within-household census 
and survey coverage error. Two recommendations for 
improving census and survey coverage are offered. First, 
survey organizations should explore the possibility of 
directly measuring the association between temporary 
mobility and incidents of census and survey undercoverage. 
This could be accomplished by adding questions about 
mobility to post-census coverage interviews used to esti- 
mate the number of people missed or counted in error. If 
the correlation between coverage error and mobility is 
significant, then survey methods and procedures could be 
adjusted to try and reduce it. For example, new roster 
probes could be added to census forms and nonresponse 
follow-up interviews, the aim being to find more of the 
Boomerangs and Floaters. Measures of temporary mobility 
might also prove to be a powerful predictor variable when 
statistically modeling the undercount. While admittedly in 
the early stages, temporary mobility looks promising as an 
avenue to better understanding household coverage error. 
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All figures and tables should be numbered consecutively with arabic numerals, with titles which are as nearly self 
explanatory as possible, at the bottom for figures and at the top for tables. 
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after the reference, e.g., Cochran (1977, p. 164). 
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Survey Methodology, December 1998 
Vol. 24, No. 2, pp. 99-100 
Statistics Canada 


In This Issue 


This issue of Survey Methodology begins with a special section entitled “Longitudinal Surveys 
and Analysis” which contains six of the papers presented at the IASS/IAOS Satellite Meeting on 
Longitudinal Studies held in Jerusalem in 1997. One or two other papers from that conference, 
which were not ready on time for this issue, may appear in future issues of the journal. I am very 
grateful to Gad Nathan and Christopher Skinner who were the Coordinating Editors for this special 
section. Without their persistence and hard work it would not have been possible. 

The first paper in the special section, by Binder, introduces the topic by reviewing the current 
status and challenges for longitudinal studies as compared to cross-sectional studies. The discussion 
is divided into four parts, reviewing in turn the special issues and challenges encountered in the 
design, implementation, evaluation, and analysis of longitudinal surveys. 

Bassi, Torelli and Trivellato consider the problem of estimation of gross flows among labour 
force states when there are classification errors in the data. They first review various strategies for 
the collection of longitudinal labour force data, and their likely implications for classification errors. 
They then present a general modeling framework and a modified LISREL model for adjusting gross 
flows estimates to correct for classification errors. The methods are illustrated by two case studies 
using data from the U.S. Survey of Income and Program Participation and the French Labour Force 
Survey. 

Clarke and Chambers consider the impact of household level non-response on estimates of labour 
force gross flows. They propose a class of models for nonignorable household-level nonresponse. 
They then use simulations to demonstrate that labour force gross flows estimates can be biased in 
the presence of this nonignorable household level nonresponse, and that estimates using household 
level nonresponse models can reduce this bias. If the household level nonresponse mechanism is 
correctly specified then this source of bias is removed completely; however, even incorrectly 
specified household nonresponse models can reduce the bias. 

Salamin considers the problem of estimating a change in proportion for a small area. He shows 
how a general multivariate logistic regression model can be used to describe the longitudinal data 
obtained from a rotating panel design. He also considers how the parameters of this model may be 
restricted to describe various types of dependance among the repeated observations, leading to 
alternative model based estimates of change. The method is illustrated by estimating changes in 
probability of being employed for a Canton in Switzerland using data from the Swiss Labour Force 
Survey. Compared to simple differences of estimated proportions of employed persons, the model 
based estimates have smaller standard errors. 

Dorfman, in his paper, attempts to treat consumer price indices from a statistical point of view. 
He first reviews price index theory in general, including the stochastic approach and objections to 
it. He then proposes a modification to the stochastic approach, based on state space modeling, which 
circumvents the major criticism of it. The approach is illustrated using price and quantity data for 
canned tuna. 

In the last paper in the special section, Tambay, Schiopu-Kratina, Mayda, Stukel and Nadon 
describe the treatment of nonresponse in the Canadian National Population Health Survey. Data 
collected at the first cycle of the survey are considered as potential predictors of nonresponse to the 
second cycle. A CHAID (Chi-square Automatic Interaction Detection) algorithm is used to 
determine weighting classes for nonresponse adjustment at the second cycle. The paper also briefly 
describes the sample design and other steps in the derivation of the estimation weights. 

Sinclair and Gastwirth study the problem of misclassification error of labour force status in the 
Current Population Survey of the U.S. Bureau of the Census. To do so, they extend the method of 
Hui and Walter, which is appropriate for dichotomous data using reinterview data, to the 
trichotomous case. Unlike other methods, this method does not assume that reinterview data is error 
free, but rather assumes an error in both the original interview and the reinterview data. They make 
an empirical assessment by comparing the estimated error rates generated by their method as 
opposed to other existing methods such as that of Poterba and Summers, and find that the degree 
of underestimation of the error tends to be higher when the true unemployment rate is in fact high. 
Finally, rather than assuming a constant error rate throughout, they attempt an analysis assuming that 
the error rates are constant only within time groupings having differing levels of unemployment. 


100 In This Issue 


Renssen considers the problem of combining information on variables collected from two 
different large surveys, using auxiliary information from a smaller third survey collecting all of the 
variables. Using ideas from statistical matching and from calibration, he proposes methods for the 
production of two-way tables, for the production of microdata files, and for the estimation of 
correlations. For the production of two-way tables his development leads to consideration of two 
different sets of calibration constraints, one termed incomplete two-way stratification and the second 
termed synthetic two-way stratification. In a simulation study using data from a pilot study for the 
Dutch Household Survey on Living Conditions, the calibration based on synthetic two-way 
stratification is shown to be much better. 

Arnab considers different strategies for sampling on two occasions. The sample at the second 
occasion is assumed to be a combination of a subsample of the first sample and a new, unmatched 
sample. Different strategies for subsampling the first sample and estimating a total at the second 
occasion are compared. He reviews strategies already existing in the literature, and proposes two 
new ones. Efficiencies of various strategies are compared analytically and empirically. 

Finally, Korn and Graubard consider the problem of generating confidence intervals for 
proportions having a small expected number of positive counts. Noting that the Clopper-Pearson 
binomial intervals traditionally used in the non-survey setting are inappropriate for use with complex 
survey data, they propose a modification of these intervals. Via simulation, they then compare the 
proposed intervals to others commonly used such as: logit-transform intervals, Breeze (1990) 
intervals based on a Poisson approximation, and normality-based linear intervals. They also illustrate 
the proposed and three alternative methods with applications using data from both the National 
Health and Nutrition Examination Survey and the Hispanic Health and Nutrition Examination 
Survey. 


The Editor 


Survey Methodology, December 1998 
Vol. 24, No. 2, pp. 101-108 
Statistics Canada 


101 


Longitudinal Surveys: Why Are These Surveys Different 
From All Other Surveys? 


DAVID A. BINDER’ 


ABSTRACT 


We review the current status of various aspects of the design and analysis of studies where the same units are investigated 
at several points in time. These studies include longitudinal surveys, and longitudinal analyses of retrospective studies and 
of administrative or census data. The major focus is the special problems posed by the longitudinal nature of the study. 
We discuss four of the major components of longitudinal studies in general; namely, Design, Implementation, Evaluation 
and Analysis. Each of these components requires special considerations when planning a longitudinal study. Some issues 
relating to the longitudinal nature of the studies are: concepts and definitions, frames, sampling, data collection, nonresponse 
treatment, imputation, estimation, data validation, data analysis and dissemination. Assuming familiarity with the basic 
requirements for conducting a cross-sectional survey, we highlight the issues and problems that become apparent for many 


longitudinal studies. 


KEY WORDS: Frames; Administrative data; Data collection; Nonresponse; Imputation; Estimation; Data analysis. 


1. REASONS FOR LONGITUDINAL STUDIES 


Each year around the world various statistical agencies 
conduct thousands of surveys. Usually, these surveys 
obtain information required for decision or policy making. 
These surveys are not conducted just for historical 
purposes, but also to have information on what measures 
may be taken to assist with making various policy changes. 
Most surveys are based on cross-sectional data, where a 
survey is taken of a particular population at a given point in 
time. Various summaries are taken about the population 
under consideration at the time of the survey. However, 
very often the interest is not so much in what actually 
happened when the survey was taken, but what would be 
the impact of making various changes. Alternatively, a 
planned change in policy may be forthcoming and 
monitoring the effect of this change is desirable. What is 
most important is the time element. For example, when 
trying to learn about certain phenomena such as health 
status or education attainment, one is interested in the 
various determinants related to these outcomes. Some- 
times, the actual temporal relationship is not even clear in 
terms of what are the causes that precede the effects. These 
could be measured if, instead of taking a cross-sectional 
survey, surveys are conducted over time, either as a series 
of cross-sectional surveys or, alternatively, using the same 
panel of respondents from one occasion to another. This 
common sense notion has led to the desire to conduct more 
longitudinal studies. This also has the benefit that the 
effects of unobserved variables may be less important when 
the same respondents are used to compare differences over 
time. 

One of the factors contributing to the increase in the 
number of longitudinal studies is that administrative data 


sources can now be used more effectively, thus making 
certain longitudinal studies feasible. Administrative data 
are becoming increasingly available. These data are often 
routinely collected for the same individuals over a period of 
time. Even if the data collected from the administrative 
sources is not ideal for the survey-taker, they may provide 
a good proxy for the information. 

The advantage of designing a study as longitudinal is 
that a common methodology can be used for each of the 
various waves of the survey. This may lead to more valid 
conclusions. Often, when trying to understand various 
patterns of social and economic change, conducting surveys 
of the same respondents on a number of occasions is best. 
Less desirable, but possibly satisfactory, is simply to repeat 
the survey from one occasion to another without necessarily 
returning to the same respondents. This may be less costly. 
The main point is that to understand certain phenomena 
over time, collecting the information on more than one 
occasion is necessary. 

When making decisions on the nature of a new 
longitudinal study, a number of cost considerations need to 
be accounted for. Obviously, one needs to consider the 
benefits against these various costs. Issues that longitudinal 
studies could address cover many subject-matter areas. We 
enumerate just a few of them. In the area of health status, 
one is interested in changes to health status and the 
determinants that lead to these changes. In other words, 
what are the health risks, and what, in fact, is the effect of 
these health risks on health status in the long term? By 
collecting the data from the same individuals over a period 
of time, one can assess these factors, not just on small scale 
studies typical of clinical trials, but on large-scale 
nationally-based population health surveys. However, the 
type of information that can be obtained from a nationally- 
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based longitudinal survey would be very different from that 
which is obtainable in a clinical trial. 

Another topic where there is interest in observations over 
time is in the area of labour and income. For example, it is 
not enough to have information on the net change to labour 
force status and labour force participation rate over time. It 
is also of interest to know which individuals move from say, 
being unemployed to working or to not being in the labour 
force. In recent times, employment patterns have changed. 
More women are working and part-time work is more 
common. Frequency of job changes is also changing. To 
understand these phenomena, longitudinal surveys can 
answer many important questions. The characteristics, for 
example, of entry level jobs taken by those who were 
previously unemployed may be of interest, as well as 
effectiveness of different job search strategies by individ- 
uals or the effectiveness of various government training 
schemes. 

Length of spells in poverty is of increasing interest. For 
example, for persons with low income, how long does one 
remain in that situation? What are the various factors that 
will determine whether this is a long-term situation? How 
important are education and other factors with respect to 
poverty and the length of poverty spells? 

In the field of education, an interesting aspect is the 
school-to-work transition at the time when people finish 
full-time school and decide to join the labour force. This 
behaviour may be measured more easily through a 
longitudinal study than through other types of surveys. 
Another education-related example is the effectiveness of 
various types of education such as vocational training and 
adult training programs. 

In justice and victimization, there are many examples 
where observing the same individuals over time can be 
beneficial. Persons who have been victimized could be 
followed up to assess the long-term implications. As well, 
persons who have been involved with the judicial system 
may be observed over time to determine the subsequent 
patterns of behaviour and the determinants for these 
patterns. 

Studies of consumer behaviour are of great interest to 
marketers and others. This would include purchasing 
patterns for consumers. Event histories for consumer 
purchasing would be very useful to many researchers. 

Studies on the effects of government transfer payments 
to individuals over time can be important to policy makers. 
A longitudinal study can determine how long individuals 
may be dependent on such government payments, whether 
or not habits are created because of the existence of some 
of these payments, what are the characteristics of the 
individuals and what are the long-term effects of partici- 
pation in various assistance programs. 

On the economic side, the longitudinal characteristics of 
various businesses are of great interest. One can measure 
how efficient these businesses are, what the use of 
technology is in these businesses, what is the long-term 
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effect of this use and how productivity is changing over 
time. Various interesting questions on business demo- 
graphics could be asked; for example, what are the 
characteristics of businesses that result in failure, what are 
the economic conditions under which businesses are 
created. As well, mergers and amalgamations are of 
interest with respect to the conditions under which these 
occur. Through longitudinal studies these phenomena can 
be more easily measured. 

There have been various structural changes to many 
businesses over the last few years and it is only through 
longitudinal studies that one can observe some of these 
structural changes at the micro level. Many measures can 
be estimated only when the respondents are measured on 
more than one occasion. 

Another area of interest is in agriculture, where the 
nature of farming is undergoing transition. Of interest is 
how farms are changing, both in terms of the products that 
are being produced and the size of the farms. Changes in 
the characteristics of who is running the operation are also 
of interest. 

As we have discussed, there are many applications and 
many facets to longitudinal studies. Also, there are many 
dimensions to their design and analysis. In the following 
sections we summarize these issues around Four Questions: 
design issues, implementation issues, evaluation issues and 
analysis issues. Many of these issues have been discussed 
in Kasprzyk, Duncan, Kalton and Singh (1989) and in 
Armstrong, Darcovich and Lavallée (1993). Some design 
issues and time series methods are reviewed in Binder and 
Hidiroglou (1988). We include a few more recent 
references. 


QUESTION 1: DESIGN ISSUES 


When designing a longitudinal study, advance planning 
is vital to the success of the study. For example, one must 
ensure that only relevant and accurate information is being 
collected from the respondents so that the potential benefit 
of the longitudinal survey is maximized. This implies that 
the longitudinal analyses to be undertaken from the survey 
should be planned from the outset to ensure that the 
relevant data are obtained. Duncan and Kalton (1987) give 
an excellent summary of many of the issues. Webber 
(1994) describes the testing strategy used in the planning of 
the Survey on Labour and Income Dynamics. Huggins and 
Fischer (1994) discuss the plans for the redesign of the 
Survey of Income and Program Participation based on their 
experiences. Longitudinal studies can be more expensive 
than a series of cross-sectional studies. Therefore, the 
benefits of collecting these data must be even greater since 
the costs themselves are higher. As well, ensuring that 
funding for a longitudinal study can be assured is important 
since the fruits from the longitudinal nature of the study 
may not be borne until at least the second or third wave of 
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the study. There is a difference, of course, between 
planning for a study to be longitudinal from the beginning 
as opposed to taking a series of cross-sectional data and 
trying to merge them into a longitudinal database. Obvi- 
ously, the former is more desirable but often, because of the 
history of the survey-taking organization, a series of cross- 
sectional data already exists so that merging these would be 
a reasonable alternative; see Hughes and Hinkins (1995). 

In general, careful attention needs to be paid to the 
design of the database for any longitudinal survey where the 
analysis includes longitudinal measures such as the study of 
episodes and spells. For some statistical agencies and 
organizations, the survey program is now in transition from 
cross-sectional surveys to longitudinal surveys. The change 
from a series of cross-sectional surveys to longitudinal 
surveys requires careful planning. When conducting 
longitudinal surveys, the databases need to be maintained 
and updated in ways that are very different from cross- 
sectional surveys. There may be many infrastructure and 
organizational issues within the agency that become 
apparent as more longitudinal surveys are being conducted, 
particularly with respect to the maintenance of the data- 
bases and the survey operations. The impact of such 
changes on the statistical organization may be substantial. 

An important issue to consider when planning for a 
longitudinal survey is whether or not the users will also be 
requiring cross-sectional estimates. Is there a requirement 
to have information about the respondents who are in the 
survey over a period of time, and also being able to produce 
estimates for a single point in time as if it were a cross- 
sectional survey? If this is the case, there are major 
implications on the way the survey is designed and 
implemented; see Lavallée (1995). This concern would 
also be present if the variables of interest include comparing 
cross-sectional estimates over time, as opposed to true 
longitudinal measures such as studying autocorrelations for 
common units in a business survey. 

Concepts and the definitions used in longitudinal surveys 
are usually obtained through consultations with the data 
users. Even the definition of the longitudinal unit to be 
observed over time may need clarification for dynamic 
populations. This is the case for both household surveys 
and for business surveys. Understanding the user re- 
quirements and discussing what can be measured over time 
with appropriate quality is important. During the survey 
planning, these requirements must be carefully weighed 
against what is operationally feasible in an actual survey 
context. Given the eventual costs of these studies, 
conducting thorough tests is often worthwhile, particularly 
on the survey questionnaires. A point that deserves more 
attention is the need for more standard longitudinal 
measures that are common across countries. This would 
permit governments and researchers to make better 
international comparisons. 

Another major component for designing longitudinal 
studies is the creation, use and maintenance of sampling 
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frames Over time in ways that facilitate the implementation 
of the study. For example, an establishment panel survey 
may be based on a business register that can be highly 
dynamic with respect to births, deaths, mergers and 
amalgamations. It is important that the definitions of which 
units are to be included in these panels over time are clear 
under these conditions. 

One reason that longitudinal surveys have become more 
prevalent in recent years is the fact that there are more 
administrative data files available now that can be used as 
frames for conducting the longitudinal studies. The 
administrative files themselves may also contain useful data 
information besides just being useful as frames per se. 
Some data manipulation of the administrative data is 
usually required to make these data useful for the statistical 
purpose of the longitudinal study, however. In general, the 
impact of frame changes to the study must be carefully 
considered at the design stages. 

A common practice is to take a number of different 
administrative files and to match them to create a sampling 
frame. As well, some longitudinal studies are based solely 
on the information contained in various administrative files. 
The difficulty, of course, is that over time these administra- 
tive files will change. This may imply a change to the 
samples that are being taken from these files, and therefore 
special measures will need to be taken to keep the analyses 
relevant. 

Often a longitudinal study is based on an existing survey 
or census conducted at a point in time in the past, and this 
then becomes the basis for the sampling frame for following 
up respondents over time. One disadvantage of this is that 
it becomes difficult to obtain cross-sectional estimates when 
births to the population are excluded from the frame. 
Record linkage techniques may be necessary for main- 
taining the frame and such techniques are usually error- 
prone. 

For rare populations, it is often advantageous to use not 
just a single frame but to use multiple frame methods. This 
ensures that there is adequate representation from the 
populations of interest that might be underrepresented in a 
single frame, but this may also require the use of record 
linkage and complex weighting techniques. 

An important design issue is the method of sampling 
from the frame once it has been established. In Kalton and 
Citro (1993), a number of different types of longitudinal 
surveys were enumerated. These were repeated surveys, 
that is, a series of cross-sectional surveys; panel surveys, 
where certain respondents are selected and followed up 
over time; repeated panel surveys, where new panel surveys 
are selected at different points in time; rotating panel 
surveys, where on each occasion a panel is dropped from 
the study and a new panel is added; overlapping surveys, 
where there are common respondents from one occasion to 
the other, but not necessarily through a fixed panel sample 
design; split panel surveys that can be a combination of 
panel surveys and repeated or rotating panel surveys. The 
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sample design must ensure that there is a sufficient sample 
from the population of interest as well from any of the 
control groups. Administrative data have proven to be very 
useful when designing a sample for many of these surveys 
as they often provide a suitable frame. 

As a referee pointed out, a key issue at the design stage 
is the strategy for dealing with sample loss though attrition, 
due to nonresponse, leaving the target population, efc. 
Possibilities include topping up the sample in subsequent 
waves, but such a strategy can distort the representativity of 
the cohort. Another strategy would be to start with a larger 
sample and not replace lost units; see, for example Singh, 
Petroni and Allen (1994). 

When deciding on a particular sample design, considera- 
tion must be given to the related weighting and estimation 
issues. As well, the periodicity or frequency of the survey 
must be established. Obviously, when the variables of 
interest change more rapidly, having the survey conducted 
more frequently would be more desirable. On the one hand, 
more frequent surveys lead to increased cost and respondent 
burden; on the other hand, less frequent surveys can lead to 
larger recall biases. These cost-quality tradeoffs are usually 
difficult to quantify. 

Very often, if both cross-sectional and longitudinal 
estimates are required, ensuring that there will be valid 
cross-sectional estimates may be necessary to select supple- 
mentary samples. This is because there may be members of 
the population in the cross-sectional estimates who were not 
in the sampling frame on previous waves and, therefore, 
would not be represented in the sample. Czajka (1994) 
studies this for the case of estimating income. 

Designing some evaluation samples is also worthwhile 
at the planning stage. There are a number of sources of bias 
in longitudinal surveys. Some of these biases can occur 
simply because the same respondent has been surveyed on 
a number of occasions. Therefore, consideration should be 
given to adding additional samples for evaluation purposes 
only, in order to be able to measure some of these impacts. 
These samples would include individuals in the target 
population that were not in the longitudinal survey. They 
are most useful for evaluating cross-sectional measures. 


QUESTION 2: IMPLEMENTATION ISSUES 


The second main issue we discuss is related to the 
implementation of a longitudinal study. First, one has 
various choices of modes of data collection. Recently, 
computer-assisted interviewing has gained popularity. With 
computer-assisted interviewing, more choices of survey 
instruments are available. For example, using dependent 
interviewing where the respondent or the interviewer has 
access to the responses from previous occasions 1s easier. 
This may increase or decrease certain biases. Hill (1994) 
asseses this in the context of Survey of Income and Program 
Participation. 
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Of course, since we are going back to the same 
respondents on a number of occasions, the question of 
response burden is even more crucial than in a single cross- 
sectional survey. We do not want to overload the res- 
pondent since this could result in higher refusal rates at later 
waves of the survey. Michaud, Dolson, Adams and Renaud 
(1995) suggest respondent burden can be reduced by 
making more use of administrative data. Reducing attrition 
due to nonresponse is an important goal in longitudinal 
surveys and consideration may be given to the use of 
monetary or other incentives to help keep the integrity of 
the sample over time; see Lengacher, Sullivan, Couper and 
Groves (1995). Another means of reducing attrition is to 
collect information to aid in the tracing efforts and to keep 
in contact with the respondents over time; McGuigan, 
Ellickson, Hays and Bell (1995) studies alternatives of 
tracing, reweighting and sample selection modelling, to 
cope with attrition problems. 

In some longitudinal surveys, some data are collected 
retrospectively; that is, questions are asked which refer to 
previous points in time as well as the current point in time. 
This could lead to what is known as seam effects. As a 
result, the observed changes over the reference periods may 
depend on which periods contain data obtained retro- 
spectively. 

Administrative records may be useful to enrich the 
database so that not all data need to be collected directly 
from the respondent; see Michaud et al. (1995). Of course, 
this could depend on the quality of the administrative data, 
its availability, and what the interplay is between the 
information from the administrative records and the survey 
variables; see Stearns, Kovar, Hayes and Koch (1996) for 
an example that studies this relationship. When dealing 
with administrative data or merged sample files, there may 
be data gaps in these various files and how to handle these 
data gaps becomes an issue. 

In general, changes to the frame structure can result in 
difficulties when performing the longitudinal analyses. 
Some key characteristics of the respondents could also be 
changing over time. For example, in a business register, if 
the industrial classification information changes because of 
the fact that businesses change the nature of the products 
that they are producing over time, being able to keep track 
of this changing classification on the database to ensure that 
the longitudinal analyses are as useful as possible is 
important. This can also complicate the analysis. 

Many issues arise when the database is obtained by 
combining the samples from a series of individual surveys. 
Integrating this information may present a challenge 
because different surveys may have used different method- 
ologies. This could result in some inconsistencies in the 
quality of the information from one database to another. 

Important issues for many longitudinal surveys are those 
related to record linkage. Record linkage is used in many 
processing steps. In some cases, the longitudinal studies 
may be based solely on these linked files. Record linkage 
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is common for creating and maintaining the survey frames, 
including linking administrative files over time, linking 
administrative files and survey frames and linking separate 
survey frames. For example, for surveys of establishments, 
we may wish to create longitudinal composite records for 
the establishments that are based on several independent 
repeated surveys, since many of the establishments are 
surveyed on each occasion. Record linkage is often used to 
find which units correspond to the same establishments. 
Record linkage is also used to identify births to a frame. Of 
course, the errors due to the record linkage can be important 
in the analysis; see Scheuren and Winkler (1993). 

In some cases, in fact, no real respondents are being 
followed over time. Instead record linkage is used to create 
artificial populations through statistical matching. These 
populations are then analysed as if they were real. 

Another implementation issue is that of handling non- 
response. It is known that nonresponse to longitudinal 
surveys does not occur completely at random. There tends 
to be differential nonresponse among different subpopu- 
lations. Therefore, special attention needs to be placed on 
how the imputations or reweighting will be performed; see, 
for example, Tambay, Schiopu-Kratina, Mayda, Stukel and 
Nadon (1998). When using administrative data as the basis 
for the longitudinal study, there may be missing admin- 
istrative data and special procedures will be necessary to 
handle this situation. 

For missing data, there are generally two methods of 
treatment: imputation and reweighting. Reweighting is 
common for situations where there is wave nonresponse. 
Imputation is more frequently used when there is partial 
nonresponse within a given wave of the survey. There can 
be advantages to longitudinal imputation as opposed to 
cross-sectional. For longitudinal imputation, the longitu- 
dinal information from the same individual on the database 
is used as the basis for doing the imputation, as opposed to 
using other individuals at the same point in time. For 
attrition and wave nonresponse, one may wish to model the 
attrition rates and use these models to compensate for the 
nonresponse through weight adjustments. A variety of 
weight adjustments were researched for the Survey of 
Income and Program Participation and the results were 
presented in Rizzo, Kalton and Brick (1994), Folsom and 
Witt (1994), and An, Breidt and Fuller (1994). Singh, Wu 
and Boyer (1995) study this problem for the difficult case 
of estimating gross flows. 

There are many complexities that may be introduced into 
the derivation of the weights. There are various approaches 
and techniques available to calculate both cross-sectional 
weights and longitudinal weights. Cross-sectional weights 
are used for measures of the population at a single point in 
time, whereas the longitudinal weights are necessary when 
data from individuals over more than one occasion are 
included. The analyst may wish to have person-level 
weights that are different from the household-level weights; 
Kalton and Brick (1995). For example, for some variables 
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such as household income, using household-level weights 
would be preferable to the individual person-level weights. 
Weighting becomes more complex with the use of multiple 
frames. Effective use of administrative data may imply 
even more complexities in the weighting scheme itself; see, 
for example Stearns et al. (1996). 

There are many causes for the samples to become 
unrepresentative. For example, lack of representativity 
could be due to problems of coverage due to immigration 
into the population. Some undercoverage may be due to 
attrition. Some overcoverage could be due to including 
some non-sampled co-habitants of a household, thus 
implying that those individuals could be included in the 
sample by living with an originally sampled person; see 
Lavallée (1995) and Kalton and Brick (1995). Other types 
of systemic overcoverage are also possible. Ensuring that 
no biases are introduced requires special weighting 
treatments. For longitudinal surveys in particular, this may 
become quite complex. Administrative data can be used 
both to assess whether or not the sample is representative 
and to provide information for making the appropriate 
adjustments. 

Since much of the estimation for longitudinal study will 
be associated with measuring change as opposed to 
measuring the phenomena at a single point in time, there 
will be questions about how to develop the variances for 
these estimates of measures of change. Some new 
procedures may need to be developed for this situation. In 
general, variance estimates can become quite complicated 
when the statistics are complex functions of the longitudinal 
observations. For example, income class boundaries may 
change over time and studying the transitions of individuals 
from one class to another is of interest. 

Another complexity of estimation may be the desire to 
include information from ongoing cross-sectional surveys 
to produce new integrated measures, using all the 
information that is available from the various available 
sources. 


QUESTION 3: EVALUATION ISSUES 


The third set of issues we discuss is related to the 
evaluation of the information and methods. Even though 
the evaluations may be conducted separately from the 
implementation, the results of such evaluations should 
impact on the survey itself, either by altering the estimation 
methods or by changing the way the survey is designed and 
implemented in future waves. 

There are many sources of biases that could be studied. 
Biases may be due to dependent interviewing by giving the 
respondent and the interviewer information that could refer 
to a previous occasion of the survey. Seam effects can arise 
from retrospective studies; see, for example Murray, 
Michaud, Egan and Lemaitre (1991). Other sources of bias 
could occur when the nonresponse is informative; that is, 
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when the nonresponse propensity is related to the variable 
of interest. An example would be when household level 
nonresponse is correlated with gross flows within the 
household, where gross flows are the changes in the indi- 
vidual’s classification; see Clarke and Chambers (1989). 
Other biases could be due to measurement or classification 
errors; see, for example, Bassi, Torelli and Trivellato 
(1998). Conditioning bias could arise from the fact that 
since we have been asking the respondents about infor- 
mation, such as labour dynamics, they may have become 
more sensitized to some of these issues so that their 
behaviour could change because of the fact that they are 
included in the survey. 

The effect of response errors and interviewer errors on 
the analysis should be evaluated. Different individual 
interviewer methods may lead to different error rates. The 
stability or instability of the turnover of interviewing staff 
could affect some analyses. Questions such as whether or 
not the information was collected by proxy can also be 
relevant. 

Other evaluations could be performed to measure the 
effect of attrition and to evaluate various imputation 
methodologies and other nonresponse handling strategies; 
see Tin (1996) for an evaluation of attrition using econo- 
metric methods. Schejbal and Lavrakas (1995) study the 
effect of panel attrition in a dual-frame local telephone 
survey. Corder, Manton and Woodbury (1994) study ways 
to improve coverage and reduce attrition in the context of 
the National Long Term Care Survey. Panel attrition could 
be the result of non-traceable or refusal cases, the impact of 
which can be quite different from cross-sectional surveys, 
and these differences should be studied. Allen and Petroni 
(1994) discuss the problem of adjusting for movers. 

There is a need to develop quality studies that take into 
account the special features of longitudinal surveys. Many 
quality control studies are available in the conduct of 
longitudinal surveys besides the usual ones for cross- 
sectional surveys, since the repeated nature of the study can 
lead to a more efficient identification of error-prone cases. 
Since for longitudinal studies, the stability of the data over 
time is an issue, methodological changes in the study could 
have an impact on the longitudinal measures that are of 
interest and these should be evaluated. Administrative data 
can provide useful evaluations since some of the data can 
help validate some of the results. 


QUESTION 4: ANALYSIS ISSUES 


Analysis concerns are the last set of issues we discuss. 
It is the potential analysis of the longitudinal study that is its 
most important facet. The causes or determinants of 
various outcomes are of major interest to the data users. 
However, the modelling of these causes can be complex, 
particularly if the survey itself is of a complex nature. 
Many of these issues are discussed in Singh and Whitridge 
(1990) and in Hidiroglou and Michaud (1998). 
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Examples of the kinds of analyses that are common 
would be measures of gross flows or other measures of 
gross change. Gross flows refers to the change of an 
individual from one category to another. In other words, it 
is the flow from category A to category B between two 
points in time, as opposed to net flow that is the change in 
the margins over time. There are difficult questions about 
the impact of measurement error on the measurement of 
gross flows. If fairly large measurement errors are present 
on each occasion, there will be a significant impact on the 
bias of the estimates of the gross flows, even if the net flows 
themselves are not as adversely affected. Sometimes, 
sample rotation will aggravate this problem, since 
accounting for sample rotation properly when measuring 
gross flows can be problematic. Special treatment is 
needed for those panels that are entering the sample on a 
given occasion and for those panels that have left the 
sample on the previous occasion to get good estimates of 
these flows. The changes to the population when gross 
flows are being measured need to be sorted out from the 
gross flows themselves. In other words, the change from 
one occasion to another is a combination of the changes in 
size of the population and the individual changes within the 
population. The situation can become even more complex 
when the gross flows are themselves analysed with respect 
to other information such as income dynamics. 

As a referee pointed out, an important issue is the need 
for educating users on how longitudinal data can be 
analysed effectively. The recent increase in the number of 
longitudinal surveys raises many opportunities for new 
types of analysis, but many analysts who have been 
studying only cross-sectional surveys may not be aware of 
the most appropriate techniques. 

For the many surveys that use frames based on admin- 
istrative data, accounting for the frame changes in the 
analysis may be necessary, since inclusion on the frame can 
be subject to changes in administrative procedures, as well 
as changing conditions for the individuals. For example a 
file of unemployment insurance beneficiaries would be 
subject to changing eligibility criteria, as well as changing 
personal situations. 

The measurement of change can often be decomposed 
into various components. For example, the movement of 
units in the sample from one domain to another can be 
sorted out from the changes of the data for units within the 
same domain. Holt and Skinner (1989) contains an 
interesting discussion on various components of change. 

For more complex analyses, such as modelling of time 
series, most classical time series models do not account for 
the fact that the information is derived from a sample 
survey. Therefore, the sampling errors resulting from the 
sample survey are not properly taken into account in the 
time series modelling. 

In the analysis, some measures may depend on other 
cross-sectional surveys. For example, it may be another 
cross-sectional survey that determines the income class 
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boundaries to be used in the analysis of the longitudinal 
survey. This may add to the complexity of the analysis 
since the boundaries can change over time. 

Whether and how to use the sampling weights have 
created difficulties for many analysts, since many of the 
classical models for analysis of data over time do not use 
the sampling weights. Procedures need to be developed 
that incorporate the survey weights in the analysis properly. 
For large-scale surveys, using the weights is often prefer- 
able as this provides some protection against model 
misspecification. 

Errors resulting from the processing, such as the record 
linkage operation, may need to be incorporated in the 
analysis or at least some studies need to be taken to 
understand the impact of these kinds of errors; see, for 
example Dorinski and Huang (1994). 

Often administrative data are used as part of the analysis 
since these data may be available more readily than collect- 
ed information. However, since there may be conceptual or 
other difficulties with the administrative data, special 
analytical methods may need to be developed to use the 
administrative data effectively. 

Finally, we mention the difficulties associated with the 
data dissemination. Longitudinal summary measures need 
to be developed for many phenomena. Often these are not 
suitable for the usual tabular displays that are commonly 
used in cross-sectional studies. Many analyses require 
access to the microdata. This could create problems with 
respect to protecting the confidentiality of the respondents. 
The usual measures that one takes when releasing micro- 
data files on cross-sectional surveys may not be sufficient 
when releasing surveys which are longitudinal in nature, 
because the databases are so much richer so that the risk of 
being able to identify an individual on such databases 
becomes much greater. Protecting the respondents’ 
confidentiality is of paramount importance, so a conserva- 
tive approach that may not fulfill all the users’ requirements 
may be necessary. 


SUMMARY 


We have briefly discussed many of the questions and 
issues that are now being investigated by researchers 
concerned with the design and analysis of longitudinal 
studies. Based on our discussion, we see that many 
questions need to be further investigated. As we gain more 
experience with longitudinal surveys, many of these issues 
will be better understood and many new issues will arise. 
The opportunities for important research and investigation 
are numerous. 
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Data and Modelling Strategies in Estimating Labour Force Gross Flows 
Affected by Classification Errors 


FRANCESCA BASSI, NICOLA TORELLI and UGO TRIVELLATO' 


ABSTRACT 


Gross flows among labour force states are of great importance in understanding labour market dynamics. Observed flows 
are typically subject to classification errors, which may induce serious bias. In this paper, some of the most common 
strategies, used to collect longitudinal information about labour force condition are reviewed, jointly with the modelling 
approaches developed to correct gross flows, when affected by classification errors. A general framework for estimating 
gross flows is outlined. Examples are given of different model specifications, applied to data collected with different 
strategies. Specifically, two cases are considered, i.e., gross flows from (i) the U.S. Survey of Income and Program 
Participation and (ii) the French Labour Force Survey, a yearly survey collecting retrospective monthly information. 


KEY WORDS: Correlated classification errors; Latent class models; Longitudinal data; Recall errors; Seam effect. 


1. INTRODUCTION 


Gross flows among labour force states, are a powerful 
tool to analyse labour market dynamics. Gross flows regard 
changes at individual level, and therefore their estimation 
rests on the availability of longitudinal data. 

The effects of erroneous classification of units with 
respect to their position in the labour market, can cause 
spurious transitions. Even if one might assume that these 
errors cancel out when estimating net flows, they cannot be 
ignored when estimating gross flows. 

Various strategies can be adopted, in order to correct 
gross flows for classification errors. Basically, they depend 
on: 


(a) assumptions about the classification error mecha- 
nism, following from 

(al) the survey design (panel surveys — possibly with 
a rotating scheme, retrospective surveys, some 
mixture of retrospective and panel surveys, efc.), 
and/or; 

(a2)the content and structure of the questionnaire 
(availa-bility of one or more indicators of the 
variable of interest, format of the questions — 
episode based or event based, efc.); 

(b) assumptions about the generating process of the 
transitions among labour force states. 


In this paper, some of the most common strategies used 
to collect longitudinal information about labour force 
condition are reviewed, jointly with modelling approaches 
developed to correct gross flows when affected by classifi- 
cation errors. It is shown that most of the usual specifica- 
tions proposed in the literature, can be seen as special cases 
of a general formulation, which allows to elucidate advan- 
tages and disadvantages of each specification, and makes it 
possible to consider a common estimation strategy. 


The focus of the paper is on sound applications of this 
general modelling approach, for estimating gross flows 
from survey data collected with different strategies. Two 
cases are considered: (i) the U.S. Survey of Income and 
Program Participation and (ii) the French Labour Force 
Survey, a yearly rotating panel survey with retrospective 
monthly information. 

The organization of the paper is as follows. Section 2 
briefly discusses various strategies for collecting longitu- 
dinal data on labour force participation, and their likely 
implications for classification errors, as they emerge from 
the survey methodology literature. In section 3, a fairly 
general approach for modelling gross flows affected by 
classification errors, i.e., for jointly estimating true gross 
flows and conditional response probabilities, is outlined. 
Examples are also given on how some well known models 
for correcting observed gross flows, can be specified as 
special cases of this approach (section 3.1). Attention is 
then devoted to a convenient framework for formulating the 
above models, provided by latent class models and, more 
specifically, by the so-called “modified LISREL model” 
proposed by Hagenaars (1990), a general tool to describe 
causal relationships among observed and unobserved 
categorical variables (section 3.2). 

The final, and main part of the paper (section 4), is 
devoted to a detailed presentation of the two case-studies. 
The modelling approach is common: a priori information 
on the measurement characteristics of the survey (and 
possibly on the true process), is combined with specifi- 
cation searches, in order to obtain parsimonious and 
(hopefully) sensible models. As already noted, the two 
case-studies are reasonably different, chiefly in terms of the 
design of the surveys: this diversification turns out to be 
useful for illustrating different model specifications, and 
various strategies for reaching/testing the final formulation. 
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From the two case-studies, the following overall evi- 
dence can be drawn: 


(a) the modified LISREL model has proved to be a set- 
up, flexible enough for modelling the error mecha- 
nism in longitudinal data collected with different 
survey designs, as well as the generating process of 
true labour force transitions; 

(b) specifically, in the measurement part of the model, 
we were able to incorporate the pattern and the 
effects of correlated classification errors, which are 
particularly important in surveys with retrospective 
features; 

(c) observed transitions are corrected towards the 
direction expected, on the basis of theoretical and 
empirical evidence on measurement errors effects, 
(not mechanically towards mobility, as strategies 
based on the assumption of independent classifica- 
tion errors do). 


2. THE ROLE OF DATA COLLECTION 
STRATEGIES 


Information for labour gross flows estimation comes 
from longitudinal data, i.e., observations on the same units 
pertaining to different time points. Recently, there have 
been increasing efforts in collecting longitudinal data. This 
is true also for surveys, whose main goal is to measure the 
labour force condition of individuals in a given population. 
On the other side, this focus on collecting, and using 
longitudinal data, raised new questions about the origin and 
pattern of measurement (= classification) errors, as well as 
their possible effects on estimates of the quantities of 
interest. General references about sources of classification 
errors for longitudinal data, collected by surveys across 
time, are Duncan and Kalton (1987) and Kalton and Citro 
(1993). In this section, some main implications of classi- 
fication errors on modelling strategies, to correct gross 
flows are briefly discussed. 

A typical argument about the effect of measurement 
error in estimating gross flows, is that it leads to over- 
estimation of changes. This is true when one assumes that 
measurement errors are not correlated over time. This 
assumption is not realistic in many cases (see Skinner and 
Torelli 1993; Singh and Rao 1995; van de Pol and 
Langeheine 1997), and should be reconsidered taking 
carefully into account, the data collection strategy actually 
adopted. Broadly speaking, if longitudinal data are (at least 
partly) collected by retrospective interrogation, one can 
argue that memory inaccuracy leads to correlated errors. 

Specific assumptions about classification errors can be 
successfully introduced in appropriate statistical models, 
only if additional information is available in the form of 
plausible a priori knowledge about the error generating 
mechanism and/or supplementary data about the labour 
force state. 
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Modelling strategies to correct gross flows for classifi- 
cation errors, should then take into account the measure- 
ment process actually used, in the sense that the amount of 
classification errors and the direction of possible bias, are 
related to the strategy adopted to collect longitudinal data. 

As it is well known, longitudinal data can be obtained by 
different survey strategies. It is convenient to distinguish at 
least between (i) panel surveys and (ii) retrospective 
surveys. In addition, the availability of multiple indicators 
deserves specific attention. 

Panel surveys are the most natural ways of collecting 
longitudinal information. Among these, rotating panel 
surveys play a prominent role. In fact, this is the scheme 
adopted in most national Labour Force Surveys (LFSs), 
whose primary goal is estimation of labour force stocks. For 
LFSs with a rotating sampling design, longitudinal informa- 
tion on the (usually short) sequence of states, can be easily 
obtained by matching data on individuals participating in 
two or more successive surveys. In LFSs, the reference 
period, concepts and definitions for classifying people, are 
typically consistent with the International Labour Office 
(ILO) recommendations (Hussmanns, Mehran and Verma 
1990): this makes measures of labour force conditions 
reasonably accurate and comparable over space and time. 
Data on labour force participation are collected also 
through general purpose household surveys. In this case, 
attention to labour force condition is less prominent than in 
the preceding type of surveys, and reference periods, 
concepts and definitions, might be less consistent with ILO 
recommendations. 

Alternatively, longitudinal information can be collected 
by retrospective surveys. Cross-sectional surveys can 
include retrospective questions, to get information on the 
sequence of labour force states experienced by sampled 
individuals. In this case, the interrogation strategy is crucial 
to reduce errors due to memory (recall errors, telescoping, 
etc.). Procedures to improve accurate reporting in retro- 
spective surveys, rely upon contributions from cognitive 
psychology and survey methodology (for a review, see 
O’Muircheartaigh 1996). Besides, evidence on the amount 
and the direction of bias due to memory inaccuracy, is 
found in many empirical studies. It is worth adding, that in 
retrospective surveys, factors related to length of recall 
period, salience of events considered, and/or difficulty in 
retrieving data on past events, usually lead to a simplified 
format of questions, not consistent with ILO conventions on 
labour force condition. 

Interesting opportunities for estimating gross labour 
flows in the presence of classification errors, come from the 
widespread practise of using a mixture of the panel and the 
retrospective strategies. Panel surveys use retrospective 
questions, at least on a limited number of topics, to cover 
the period between two successive waves (this is the case of 
the Survey of Income and Program Participation, as will be 
seen in section 4.2). The main characteristics of the 
measurement process when such a mixed strategy is used, 


Survey Methodology, December 1998 


have to be carefully considered, as they might have a 
considerable impact in formulating reasonable models for 
classification errors. More specific traits of the measure- 
ment process emerge also from consideration of the pecu- 
liarities of the survey design. 

From a different perspective, an important opportunity 
for modelling classification errors is given by the avail- 
ability of multiple measurements of labour force state, i.e., 
data on the labour market condition of an individual at a 
given time, provided by two or more different sources. This 
information is of great importance in general, and par- 
ticularly when fairly complicated patterns of correlated 
classification errors are to be considered. Multiple indica- 
tors on labour force state can be collected (i) in the same 
interview or (ii) in different interviews (e.g., in different 
waves of a panel survey). 

The first case is not very common, but sometimes 
questions regarding labour force condition are asked in 
different contexts, and in different ways. For instance, first, 
a self-classification of the individual with respect to labour 
force condition is asked; then, in a different section of the 
questionnaire, a sequence of questions are put forward that 
allow to classify the respondent according to standard 
labour force definitions. (For a different example, see the 
case of the Survey of Income and Program Participation in 
section 4.2.) 

The second case covers several situations. At least two 
of them are worth considering: 


(a) data from reinterview studies, often collected speci- 
fically to get information on classification errors prob- 
abilities (in such a case, the common practice is to 
assimilate reinterview data to validation data: for 
classical procedures to correct gross flows based on 
reinterview data, see Abowd and Zellner 1985, 
Poterba and Summers 1986, and Chua and Fuller 
1987); 

(b) data collected retrospectively in panel surveys, but 
referring to a time point already covered by the 
preceding interview, or collected in a supplementary 
survey carried out occasionally and covering the 
reference period(s) of the current panel survey. It is 
obvious that, in this case different measures of the 
same variable(s) of interest can be polluted by 
classification errors with largely different 
characteristics. 


Many of the points raised here will be clarified in the 
case-studies presented in section 4, where the joint presence 
of panel and retrospective information and of multiple 
indicators of the same latent variable is exploited in order 
to get parsimonious models. 
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3. ESTIMATING GROSS FLOWS AFFECTED BY 
CLASSIFICATION ERRORS 


3.1 A General Framework 


Specification of statistical models to adjust labour force 
gross flows for classification errors, should allow one to 
take into account, the nature of available data (as reviewed 
in the previous section), and substantial assumptions on the 
generating process of (i) transitions among labour force 
states (e.g., Markov chain structures) and (11) measurement 
errors (e.g., uncorrelated vs. correlated measurement errors). 

In the simplest case, we consider panel data, where at 
each time period ¢=1,...,7, a discrete variable Y, is 
observed for a generic unit, in a random sample of size n. In 
our case-studies, the units will be individuals, and the time 
periods, months or quarters. Y, takes one among r possible 
distinct values or states. Y, is an imperfect measure of y,, 
which denotes the true state of a generic unit at time ¢. In 
general, it is not necessary to assume, that y, varies over the 
same set of states 1, 2, ...,7, but for simplicity, and without 
loss of generality, we will consider here the same set of 
states as for Y,. 

Strategies for estimating gross flows, rely upon an 
appropriate specification of the joint probability of the true 
and the observed process P(Y,, ..., ¥-,Vj, +++» Vp). Statistical 
analysis is then based on marginalization with respect to 
unobserved quantities: 
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Models are based on parsimonious specifications of the 
joint probability function P(Y,,..., ¥,,),,-- Y,). Essen- 
tially this can be obtained by decomposing it into a product 
of conditional probabilities, following from an appropriate 
set of assumptions about the dependence structure among 
the components Y,, ..., Vn, Vj> +. Yr: 

For our purposes, a convenient starting point for model 
specification, comes from assumptions (i) about the struc- 
ture of the generating process of the true transitions among 
labour force states and (ii) about the measurement process 
(exploiting, for instance, substantial knowledge or empirical 
evidence from the data collection strategy adopted). 

In a model aimed at distinguishing between true and 
observed turnover in the labour market, a typical example 
that exploits this idea, is provided by Latent Class Markov 
(LCM) models (van de Pol and Langeheine 1990). For a 
generic unit, the following probabilities are specified: 


by 
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Conditional probabilities (3.2) represent the relationship 
between true and observed states, i.e., the probability of 
reporting at time f¢, state /,, while the true state is /,. 
Clearly, this specification implies the local independence 
assumption, 7.e., Y,, ..., Y, are independent, given Vaeasar: 
Conditional probabilities (3.3) describe the dynamics in the 
labour market, i.e., the probability that a transition from j,_, 
to j, occurs, when moving from time ¢ - 1 to t: according 
to (3.3), the true transition process evolves following a first 
order Markov chain. Finally, probabilities (3.4) describe the 
initial condition for the Markov process. 

The marginal probability for the observed sequence (3.1) 
is then given by: 
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For four measurement points, model (3.5) is equivalently 
represented by the path diagram in Figure 1, where arrows 
indicate direct effects between variables. 
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Figure 1. Path Diagram of a LCM Model for Four Measurement 
Points 


It is worth observing, that the assumption of local 
independence is equivalent to the Independent Classifica- 
tion Errors (ICE) assumption. As noted in the previous 
section, the ICE assumption has been severely criticised, 
and seems definitely unreasonable when longitudinal data 
are collected by retrospective questions. 

As another example, for T= 2, classical strategies to 
correct gross flows based on reinterview studies, can be 
represented within the framework outlined above. In this 
case, additional information is used, in the sense that the q, 
parameters are exogenously estimated from the reinterview 
study, and are plugged in (3.5) in order to obtain directly 
P(Y), Y). 

The same framework can be used, to encompass more 
general assumptions on both the latent and measurement 
processes, up to include serially correlated classification 
errors. As an interesting case, we consider the model by 
Pfeffermann, Skinner and Humphreys (1998). Ignoring here 
initial conditions, they reformulate conditional response 
probabilities as follows: 

SPs sh \y, “Ip ee ail) 
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thus overcoming the ICE assumption. 
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A similar formulation, aimed at introducing, at least 
partially, dependence between the observed state at time ¢ 
and the sequence of true states at times ¢ and ¢ - 1, has been 
suggested by van de Pol and Langeheine (1992), who 
extend the model to allow also for a second order Markov 
chain, for the true transition process. 

The modelling strategy for estimating true flows can be 
further extended in various directions, namely: 

(a) Itis straightforward to extend the model, to exploit the 
availability of multiple indicators of the same unob- 
served true state. This implies that response probabi- 
lities, as those in (3.2), are defined for one or more 
additional observed variables, treated as imperfect 
measures of the same latent state y,. As an example, a 
LCM model for two indicators per latent variable, and 
four points in time, is represented in Figure 2. In this 
model, each couple of indicators referring to a given 
point in time, is assumed to be independent, condition- 
ally on the corresponding latent variable, in the sense 
that the correlation between them, is completely 
explained by their relation with y,. 

(b) Observed heterogeneity at the individual level, in the 
transition and/or the measurement processes, can be 
introduced by conditioning on a set of covariates X,. 
An example is given in Pfeffermann et al. (1998). They 
use covariate information at the unit level and model 
their impact on labour market condition by multinomial 
logit . 

(c) Unobserved heterogeneity can also be considered, 
which leads to mixed latent class models (van de Pol 
and Langeheine 1990). A simple case is the movers/ 
stayers model, where a different behaviour, at the latent 
level, is assumed for groups of units, while the group 
membership of the units cannot be directly observed. 
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Figure 2. Path Diagram of a LCM Model for Four Measurement 
Points and Two Indicators for Each Latent Variable 


3.2 Latent Class and Related Models as a Tool for 
Estimating Gross Flows With Measurement 
Errors 


A special case of the general model formulation outlined 
in the above section, are latent class models, where the true 
state in the labour market plays the role of the latent 
variable, and the observed state acts as its indicator. Some 
of the specifications outlined in the previous section, 
include dependence among classification errors. A general 
and convenient approach for handling it, which includes 
standard latent class models with correlated classification 
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errors, is the so called modified LISREL model proposed by 
Hagenaars (1990). 

The modified LISREL approach consists of an extension 
of Goodman’s (1973) path analysis, which is a tool to 
describe causal relationships among observed categorical 
variables, through a system of logit equations. Basically, the 
extension incorporates latent variables. Thus, a modified 
LISREL model combines a measurement sub-model, which 
specifies the dependence of the indicators on latent 
variables, and a structural sub-model, which specifies 
ordered relations among latent and possible external 
variables. As the name itself suggests, it can also be viewed 
as the analogue for discrete variables, of the well known 
LISREL model for continuous variables (Joreskég and 
S6rbom 1988). 

Modified LISREL models, allow to introduce serially 
correlated classification errors, by inserting direct effects 
between the indicators (Hagenaars 1988). The presence of 
direct effects implies, that the association among observed 
variables, is not completely explained by the effects of the 
latent variables on their indicators, but that there exists a 
source of additional association among the indicators, over 
and above the part that is explained by their relation with 
the latent variables. 

Once a reasonable model has been specified, identifi- 
cation should be ascertained. The model involves many 
unobservables, and identification of all parameters is not 
automatically assured. 

Reasonable opportunities to achieve identification, rest 
on two strategies, possibly used in combination: (i) 
imposition of plausible equality restrictions among the set 
of parameters and (ii) availability of multiple indicators of 
the unobserved true state. The latent class Markov model 
represented in Figure 1, for example, is not identified 
without extra restrictions on its parameters. If the latent 
chain is assumed to be time homogeneous, or response 
probabilities are restricted to be equal across time, the 
model can be shown to be identified (Lazarsfeld and Henry 
1968). Availability of multiple indicators for the unob- 
served true state, can also help identification of complex 
measurement models. Identification criteria for some very 
special specifications, have been proven (for example, the 
model in Figure 2 can be shown to be identified), but no 
general rules have been provided yet to ascertain global 
identification. It is advisable to check at least local 
identification, i.e., identifiability of the unknown para- 
meters in a neighbourhood of the maximum likelihood 
solution. Goodman (1974) stated that a sufficient condition 
for local identifiability of a latent class model, is that the 
Information matrix be full of rank. Goodman’s condition 
may be computationally difficult to check. Moreover, with 
some data sets, it may happen that the Information matrix is 
not of full rank, simply because some estimates are very 
close to the boundaries of the parameter space. An 
alternative, empirical way to check identifiability, is to 
estimate the model using different sets of starting values. If 
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different sets of starting values result in the same value for 
the log-likelihood function but in different parameter 
estimates, then the model is not identifiable. 

As for estimation, modified LISREL models may be 
treated as directed loglinear models with latent variables 
(Hagenaars 1997). A directed loglinear model results in a 
sequence of parsimonious multinomial logit models, 
possibly with latent variables, which are estimated stepwise. 
At each step, one dependent variable is considered, and a 
multinomial logit model is estimated on a contingency 
table, which has been collapsed over the variables, that do 
not directly influence the dependent variable in the causal 
order. Estimates obtained at each step are, at the end, 
combined in order to obtain estimated parameters for the 
full model. Directed loglinear modelling yields exactly the 
same parameter estimates, standard errors and test statistics 
as the Goodman standard procedure, but using simpler 
marginal tables. If the causal model contains one or more 
latent variables, an appropriate estimation technique must 
be used, e.g., an implementation of the EM algorithm 
(Meng and Rubin 1993). 

The empirical validity of the complete causal model may 
be tested, comparing the estimated expected frequencies 
with the observed ones in the complete table, by means of 
the likelihood ratio L* and the Pearson X° statistics. 
However, the structure of the observed data on labour 
market transitions, is such that many cells show very low 
observed frequencies. For this reason, the usual X? and L? 
criteria must be used only as a general indication of fit, 
since their asymptotic X? distribution is no longer 
guaranteed, due to the sparse and unbalanced pattern of the 
contingency table. 

Various strategies can be adopted to extend and improve 
model evaluation, and three of them are worth mentioning 
in this context: 


(i) A restricted model nested within a larger one, can be 
tested with the conditional test, i.e., considering the 
difference in the L? values of the two models, which is 
asymptotically distributed as y? under weaker condi- 
tions (Goodman 1981, and Haberman 1978). 


(ii) In general, using multiple criteria can be a sensible 
strategy. Indices based on the information criterion, 
such as AIC or BIC, can be useful to compare alterna- 
tive non-nested models. Another advantage of AIC and 
BIC is that, in the selection procedure, they weight the 
goodness of fit of a model against its parsimony, 
considering the model degrees of freedom and the 
sample size. (AIC = L? - 2 x degrees of freedom. BIC 
= L? -1n(N+1) x degrees of freedom.) The model that 
is preferred, in this context, is the one with the lowest 
value of AIC or BIC. 


(iii) Monte Carlo resampling techniques can be implemented 
to simulate the asymptotic distribution of X* and L? 
(Langeheine, Pannekoek and van de Pol 1995). 
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4. TWO CASE-STUDIES 


4.1 The General Set Up 


In this section we present two applications of the 
modified LISREL approach to correct observed gross flows 
in the labour market. Data come from surveys with partly 
different designs: 


(1) the U.S. Survey of Income and Program Participation 
(SIPP), a multi panel household survey, which collects 
retrospective information on the between waves 
working history; 

the French Labour Force Survey (FLFS), a yearly 
retrospective survey, with one month overlapping 
reference periods. 


(2) 


For each case-study, a model is specified on the basis of 
a priori information on both the true transition process and 
the error generating mechanism. A priori information is 
crucial for model specification, in order to obtain parsi- 
monious and plausible models. 

All the models are written in the form of a modified 
LISREL model, and estimated by the EM algorithm. 
Actually, we used the JEM program (Vermunt 1993) and 
checked all the models for local maxima. 

The two final models turn out to be rather complex, since 
they incorporate correlation among classification errors, 
and specific assumptions on respondent’s behaviour. This 
fact, together with the sparse and unbalanced pattern of the 
observed contingency table, typical of labour force 
transitions, demands for goodness of fit evaluation criteria, 
other than L* and X°. In the first case-study, alternative 
models have been judged by means of the BIC index, and 
on the basis on substantive knowledge on the labour market 
in the U.S.. In the second case, alternative models have 
been compared by means of the conditional test. 

In the following sections, models are presented in a 
logical and verbal form, while the mathematical formulation 
for the final model is given in the relevant Appendix. 


Interview Rot. ae 

Month Group 

February Z 1 Oct Nov Dec Jan 
March 3 1 Nov Dec Jan Feb 
April 4 1 Dec Jan Feb 
May 1 Jan Feb 
June 2 2 Feb 
July 3 2 

August 4 D 

September 1 2 


Figure 3. 
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4.2 The SIPP Data 


SIPP is a multi panel household survey conducted by the 
U.S. Bureau of the Census, in order to collect information 
on topics such as employment, income, participation in 
social programs, efc. The reference population is the U.S. 
noninstitutionalized individuals over 14. 

The survey started in 1984, and is a continuing one: as a 
general pattern, each year a new sample of households, 
called “panel”, has been selected for the survey and 
followed for two and half years (for a detailed description 
of SIPP, see U.S. Department of Commerce 1991, and Citro 
and Kalton (1993)). 

Each panel is randomly divided into four “rotation 
groups” and interviewed at 4-months intervals for eight 
times. For practical reasons, each rotation group is 
interviewed in each of four consecutive months, and 
retrospective questions collect information with reference 
to the 4-months period elapsing between subsequent 
interviews. Each set of interviews with the full sample is 
termed a “wave”. 

We will refer to the 1986 panel, which started in 
February 1986 and ended in August 1988. We will consider 
the intermediate period from January 1986 to January 1987, 
over which we have information from all four rotation 
groups. Figure 3 represents the survey design with regard to 
our sample. 

Information on labour force participation, is collected 
mainly in the “Labour Force and Recipiency” section of the 
questionnaire (for an additional piece of information, 
collected in another section of the questionnaire, see 
below), where each respondent is asked to report on a 
weekly basis his/her labour market history in the preceding 
four months (18 weeks), by going through a series of 
filtered questions. The respondent is first asked whether 
he/she had a job or a business, at any point in time during 
the reference period. If the respondent gives a negative 
answer, he/she is asked whether he/she spent any time 
looking for work, or was in layoff, and, if so, in exactly 
which weeks. On the other hand, if the answer to the 


Reference months 


Mar 
Mar Apr 
Mar Apr May 
Mar Apr May Jun 
Apr May Jun Jul 
May _ Jun Jul Aug 


Rotation Plan for the 1986 SIPP Panel (First 2 Waves) 
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starting question is positive (i.e., he/she worked some time), 
and the respondent declared a job or a business with 
continuity during the reference period, he/she will move to 
the following section of the questionnaire. The respondent 
not declaring a stable situation in the labour market, is 
asked a long series of questions in order to establish the 
labour force state occupied, in each single week of the 
reference period. 

The weekly based information is usually recorded, to 
obtain a monthly classification based on the usual three 
categories: Employed (E), Unemployed (U) and Not in the 
labour force (N). For individuals covering different 
positions during one month, the monthly labour force state 
is the one identified by the “modal” category with regard to 
the weeks of that month (Martini 1989). 

Observed gross flows between two generic calendar 
months are then obtained as follows: 


(a) For individuals belonging to three rotation groups, on 
the basis of retrospective data collected in the same 
interview. These observed flows will be called “within 
wave” (WW) transitions. 

(b) For individuals in the fourth rotation group, by 
combining information collected in two different 
interviews, four months apart. These observed flows 
are termed “between waves” (BW) transitions. 


When estimating monthly changes, a peculiar problem 
with SIPP data, is the so called “seam effect” (Young 
1989): more changes are observed when data for two 
adjacent months are collected in two different waves — the 
transition covers the seam of the waves — than when they 
come from the same interview. The seam effect is pervasive 
in the survey: evidence of it for several variables of interest, 
is reported in Martini (1988), Marquis and Moore (1989), 
Kalton and Miller (1991). 

Table 1 illustrates this phenomenon for our 1986 SIPP 
panel sample. Row 4-1 contains average BW transition 
rates; rows 1-2, 2-3 and 3-4 contain average WW 
transition rates, pertaining to the position of the two 
relevant reference months in each wave (for example, row 
1-2 contains transition rates between the first two reference 
months in each wave). From Table 1, there is clear evidence 
that observed WW transitions describe a more stable labour 
market than BW ones. Moreover, WW stability increases, 
moving backwards in the wave (from 3~4 to 12). 

One reasonable explanation for the seam effect, and for 
the systematic pattern of observed transitions throughout a 
wave, is the different role of measurement errors, for data 
obtained under the BW and WW strategies respectively. 
Specifically, it is likely that classification errors have a 
different degree of correlation for WW and BW observed 
flows: the higher stability documented by WW transitions 
may be induced by highly correlated classification errors. 
Indeed, if errors were uncorrelated, specifically for WW 
transitions, no evidence of seam effect would be expected. 
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A variety of plausible causes of correlated errors, is 
suggested by the cognitive psychology and the survey 
methodology literature on memory effect and recall errors 
(see, Bernard, Killworth, Kronenfeld and Sailer 1984, and 
O’Muircheartaigh 1996), among which a “conditioning” 
effect: respondents tend to give the same answer going 
backwards within the wave, and in extreme cases, they 
mechanically repeat the same answer for all four months. 


Table 1 
Observed Monthly Transition Rates (x100) for the 1986 
SIPP Panel, January 1986 to January 1987 


ype EE SEU SENS SUBS UU UN NESSNU NN 
1-2 WW 98.27 1.04 0.69 15.46 79.63 4.91 1.15 1.42 97.43 
OFS 113 80:96 13471596 ns Os On 38) 1716.96.91 
OTASSE E20 NOS 9235 7132580 a2 le28 ee 69297-03 


4-1 BW 94.03 2.10 3.87 26.81 42.20 30.99 5.65 3.77 90.58 

Abundant empirical literature shows, that this sort of 
conditioning effect is the main source of classification 
errors in SIPP data. Other potential sources of error, typical 
of panel surveys, do not affect SIPP data dramatically. 
Administrative record check studies find little, if any, 
evidence of time-in-sample effect (Chakrabartry and 
Williams 1989; McCormick, Butler and Singh 1992). Asa 
general consideration, we may say that in SIPP data, the 
seam effect dominates over other sources of error, that 
potentially bias gross flows estimates. 

Summing up, a model-based approach to obtain unbiased 
gross flows from SIPP data, is justified by two arguments: 


(a) the patent presence of correlated classification errors; 
(b) a priori information on the data generating mecha- 
nism, drawn from two sources: 

(bl) specific evidences emerging from SIPP ob- 
served gross flows, such as the seam effect, 
and the increase in stability going backwards 
within the wave, just documented; 

(b2) general hints provided by the social survey 
literature on respondent behaviour. 


In order to correct SIPP observed labour force gross 
flows from classification errors, a model has been built, 
based on the following assumptions/information: 


(a) the true transition process follows a first order Markov 
chain; 

(b) WW data transitions are affected by correlated classi- 
fication errors, according to a pattern that will be 
specified in the sequel; 

(c) for BW, the standard ICE assumption holds; 

(d) rotation groups are equivalent samples also for model- 
ling purposes, i.e., respondents behave in the same way 
in all four rotation groups; 

(e) SIPP data provide two indications on the monthly labour 
force state of each individual: the detailed information 
collected in the “Labour Force and Recipiency” section 
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of the questionnaire, just presented, and the 
additional information collected in the “Earnings 
and Employment” section, where the respondent is 
asked if he/she did/did not have a job in the 
reference period, on a weekly basis. 
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Figure 4. Path Diagram of a Modified Lisrel Model for Four Measurement 


Points and Two Indicators for Each Latent Variable (for the 
Meaning of Symbols, See Main Text) 


Figure 4 contains the path diagram of a simplified 
version of the model (i.e., a version that does not aim at 
representing in detail, the pattern of correlated classification 
errors, nor at taking into account the fact that we are dealing 
with four rotation groups) for four points in time, /.e., for 
four consecutive calendar months. Here y, (t = 1, 2, 3, 4) 
represents latent variables; Y, and W, represent indicators, 
arrows indicate direct effects between pairs of variables. 
Indicator Y, refers to the reported labour force state, 
described by the usual three categories (E, U and N), while 
W, refers to the binary variable Job/No Job. Since 
information is collected in two different sections of the 
questionnaire, and with different interviewing procedures, 
Y, and W, can be assumed to be independent given y,. On 
the other hand, direct effects between the indicators, 
account for correlated classification errors over time: the 
response given for time ¢ + 1 affects that given for time f¢. 
Note also, that an additional variable G with four categories 
should be added to the diagram, to account for rotation 
group membership. All indicators depend on G, since units 
in different groups are interviewed in different calendar 
months. 

The basic equation of the model, decomposes the 
proportion in the generic cell of the 9-way contingency 
table, in the product of the conditional probabilities re- 
ported in Appendix A, equations (A1) to (A7). A prelimi- 
nary version of the model has been proposed in Bassi, 
Croon, Hagenaars and Vermunt (1995). 

Equation (A1) defines the probability of belonging to 
one of the four rotation groups. Equations (A2) and (A3) 
define the initial condition, and the transition probabilities, 
of the latent first order Markov chain respectively. 
Equations (A4) and (A5) define the response probabilities 
for indicator Y,; equations (A6) and (A7) the analogous 
probabilities for the dichotomous indicator W,. The 
response probabilities are defined in such a way that the 
answer given for a certain month, depends jointly on the 
current true state (y,) and on the “past” true and “past” 
reported states (y,,, and Y,,,). The term “past” refers to 
the way respondents think, while answering retrospective 
questions: they start recalling from the moment of time 
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nearest to the interview, and go backwards up to the end of 
the reference period. 

A complex set of constraints has been imposed on 
response probabilities of (A4), (A5), (A6) and (A7), to 
account for (i) the conditioning effect, and (ii) the fact that 
the four rotation groups are equivalent samples in terms of 
the error generating mechanism. 

These constraints are formulated in detail in Appendix 
A. Basically, they incorporate a priori knowledge on 
respondent’s behaviour, and allow us to specify a parsimo- 
nious model. Specifically, equations (A8) to (A14) corres- 
pond to the following statements: 


(a) With regard to WW classification errors, following 
Hubble and Judkins (1989), it is assumed that: 

(al)a respondent who reports wrongly his/her labour 
force state for a certain month, continues to 
repeat this same answer also for the adjacent 
month, going backwards within the wave (A8); 

(a2)if, however, the status at time ¢ + 1 is correctly 
reported, the response probability for the adja- 
cent month depends only on the current true 
state (A9); 

(a3)the same error generating mechanism operates 
for both indicators. For W,, we state that a 
correct answer is given when the true state is E 
and ‘Job’ is reported and when the true state is U 
or N and ‘No Job’ is reported, (A10) and (A11). 

(b) Response probabilities are set equal across rotation 
groups, (A12) to (A15). As an example, equalities in 

(A12) mean that response probabilities for individ- 

uals in rotation group 1 for the month of April, are 

equal to response probabilities for individuals in 
group 4 for the month of March, to those for 
individuals in group 3 for the month of February, and 
to those for individuals in group 2 for the month of 

January. (They are set to be equal, since they all refer 

to the answer given for the last month of the wave.) 


The model has been estimated to correct observed 
monthly gross flows for the quarter January to April 1986 
(Table 2). The comparison between observed and estimated 
flows, highlights that the model reduces the seam effect: 
WW transitions are corrected towards a more dynamic 
labour market; BW transitions are corrected in the opposite 
direction. It is worth noting, that effects of model correction 
are more evident for flows from unemployment, which are 
characterised by higher mobility. 

The goodness of fit of the model has been judged by 
multiple criteria such as the BIC index and the conditional 
test for nested models, together with estimate inter- 
pretability and consistency, with substantive knowledge of 
the dynamics of the U.S. labour market in ‘80s. 


4.3. The French Labour Force Survey Data 


The second case-study refers to the flows in the labour 
market, observed with the French Labour Force Survey 
(FLFS) conducted yearly by INSEE in France. 


Survey Methodology, December 1998 TAT 
Table 2 
SIPP Observed and Estimated Monthly Transition Rates 
(x100), January to April 1986 
EE EU EN UU UN NE NU NN 

J-F WW 98.11 g/ 0.72 14.53 80.16 Syaill 0.90 Sa O75 
BW 94.08 AG) 3715 23.58 44.30 32,112) 5.62 3.45 90.93 
Estimated 9725 1.47 1.28 16.08 77.16 6.76 1S 9 1532 97.09 
F-M WW 98.66 O92 0.42 16.06 78.67 S27 0.64 1.65 OT 
BW 94.88 1:91 3.21 21.90 48.54 29.56 4.99 4.11 90.90 
Estimated 97.83 1.20 0.97 19.40 74.01 6.59 21 1.50 97.29 
M-A WW 98.71 0.64 0.65 20.76 71.74 7.50 1.47 1.05 97.48 
BW 95:59 1.52 2.89 30.48 34.92 34.60 6.34 3.78 89.88 
Estimated 98.11 0.95 0.94 26.42 65.75 7.83 2.17 0.71 97.12 


The reference population of the FLFS are all members of 
French households, who are above 15 in the year in which 
the interview is planned. The survey has a rotating design: 
each year, one third of the sample is renewed. 

Information on labour force participation is collected 
with retrospective questions, having as a reference period 
the 13 months preceding the interview. Each respondent is 
asked to recall his/her position in the labour market on a 
monthly basis, by filling in a grid in which he/she can 
classify himself/herself, for each month, over eight 
categories: self-employed, employed on a fixed term basis, 
permanently employed, unemployed, on training, student, 
serving in the Army, other (retired, housewife, efc.). 

For our analysis, we aggregated the eight categories in 
the usual three states E, U and N. We consider ‘Employed’ 
respondents who classify themselves in the first three 
categories, ‘Unemployed’ those who classify themselves in 
the fourth category and ‘Not in the labour force’ the 
remaining ones. 

We analyze the information collected in the two 
consecutive waves of March 1991 and March 1992, ona 
subsample of individuals: those who answered to three 
consecutive interviews (January 1990, March 1991 and 
March 1992) and who were 18 to 29 years old in 1992, for 
a total of 5,427 individuals. The reference periods of the 
two waves considered, overlap in March 1991. We have, 
then, two pieces of information on the labour force state for 
this month: one collected in March 1991, and the other one 
collected with a retrospective question 12 months afterwards. 

The pattern of observed monthly transitions in our FLFS 
sample shows some interesting evidence, largely dictated by 
the characteristics of the subsample — young people. 

Transitions exhibit a moderate degree of seasonal 
variation, related to the school calendar. From June to July, 
for example, we observe a proportion of people who enter 
the labour market as employed, greater than the average; on 
the contrary, from August to September, a proportion 
greater than the average leaves employment (presumably to 
education). 


The marginal distribution of the three states from March 
1990 to March 1992, shows that the individuals in our 
sample progressively enter the labour market: in March 
1990, 44% are observed to be Employed or Unemployed, 
whereas by March 1992, this proportion has risen to 54%. 

The double information for March 1991, provides some 
crude evidence on response error in the data: 8% of respon- 
dents declare a different state in the two interviews. For the 
period from February to April 1991, two types of flows may 
be observed: a within wave (WW) one, i.e., information 
about the labour force state is collected in the same 
interview, and a between waves (BW) one, i.e., information 
is collected in two different interviews (Table 3). 


Table 3 
FLFS Observed Monthly Transition Rates (x100) from 
February to April 1991 


FEED eu 
F-M WW 98.19 1.67 
BW 93 179358 
M-A WW 98.60 1.04 
BW 93.24 3.33 


BNE UE UU UNS INE 
0.14 9.11 90.65 0.24 0.28 
B25 9251184659238 19 59s aS 
0.36 8.89 90.37 0.74 0.24 
3.43 25.90 63.79 10.31 3.79 


NU NN 
0.11 99.61 
1.96 94.29 
0.29 99.47 
2.07 94.14 


As expected, WW transitions describe a more stable 
labour market than the BW ones. This can be considered as 
an indication of correlated classification errors in the data. 
Patterns and causes of errors correlation in retrospective 
surveys, have been extensively discussed in the two 
previous sections, and the above considerations can largely 
be extended to the FLFS data. 

In general, we expect that, in a retrospective survey with 
such a long recall period, lack of memory results in the 
major cause of classification errors. We also expect that the 
probability of answering incorrectly, increases as the 
distance between the reference month and the interview 
month gets longer. This may be considered as the major 
source of correlation among classification errors, together 
with telescoping and conditioning effects, which possibly 
affect FLFS data as well (see Magnac and Visser 1995). 
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The overall effect of correlated classification errors, 
reasonably results in an underestimation of mobility in the 
French labour market. 

Moving from these considerations, we specified a model 
to correct observed quarterly gross flows, from measure- 
ment error (Table 4). The last column of Table 4 contains 
the percentage of individuals who are observed to change 
state, between the two months considered (OM = observed 
mobility). On the average over the five WW transitions, 
6.122% of mobility between two consecutive months is 
observed. 

As in the previous case-studies, let us denote with 
y(t=1,2,3,4,5,6) true labour force states, and with 
upper case letters their indicators: Y,(t =2,3,4,5,6) 
represents labour force states observed in March 1992 
(referring to March, June, September, December 1991 and 
March 1992); W,(t = 1,2) represents labour force states 
observed in March 1991 (referring to December and March 
1991). As usual, ye Me ; and W, distribute over the three 
categories of E, U andN. 

The model is specified by decomposing the proportion 
in the generic cell of the 7-way contingency table as in 
Appendix B, equations (B1) to (B6). 

Since we observe two indicators only for one month, a 
model which assumes direct effects between the indicators, 
would be under identified. Thus, we can not explicitly 
model dependencies between observed states. The only way 
to account for correlated classification errors in FLFS data, 
is to let observed states depend on latent transitions. By the 
way, this seems to be a sensible assumption in retrospective 
surveys. Indeed, flows between two different states may 
easily undergo wrong placements in time, because in some 
situations, events might truly be difficult to place exactly. 
As an example, employees who loose their job or retire 
(flows EU and EN), will generally use the holidays they are 
entitled to, and may not clearly know when they exactly left 
employment. The moment people entered the labour force, 
may also be hard to recall, especially when they left school 
(flows NU and NE) (van de Pol and Langeheine 1997). 

The modified LISREL model, formulated in mathe- 
matical terms in Appendix B, is based on the following 
substantive assumptions. 
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At the latent level, transitions follow a first order non 
stationary Markov chain (equations (B1) and (B2)). Indeed, 
the evidence on seasonality in observed transitions, suggest 
avoiding the imposition of stationarity of any order, on the 
latent Markov chain. 

Response probabilities for data collected in both waves, 
depend on the latent transition occurring between ¢ and t+ 1 
(equations (B3) and (B4) refer to data collected in March 
1992, equations (B5) and (B6) to data collected in March 
1991). 

In order to describe the error generating mechanism in 
detail, and specify a more parsimonious model, the 
following constraints have been imposed on response 
probabilities: 


(a) response probabilities referring to the same month of 
subsequent years (December and March) are set equal; 

(b) response probabilities at time f, given that the true 
state has not changed between time ¢ and time ¢ + 1, 
are set constant over time; 

(c) response probabilities are set equal for June and 
September 1991; 

(d) in general, respondents who move between month ¢ 
and ¢ + 1 (transitions EU, EN, UE, UN and NU), at 
time ¢, report either the true state occupied at time f, or 
the true state occupied at time ¢ + 1, i.e., they, do not 
report a state they have not been moved from/to; 

(e) if however, the latent transition occurs between states 
N and E, we admit all three answers at time f, i.e., we 
consider that people who find a job may confuse their 
previous position (at time f), and be uncertain between 
U andN. 


Constraint (c) is imposed mainly for reasons of model 
parsimony. It captures the notion that response probabilities 
for months that are placed more or less in the central part of 
the reference period, do not vary too much. 

Constraints (b) and (d) reflect the fact that response 
probabilities depend on latent transitions. We expect that 
these probabilities do not vary too much over time when 
there is no latent change (constraint (b)), whereas we expect 
that the probability of misplacing change, especially in 
ambiguous situations, increases with the length of the recall 


Table 4 
FLFS Observed Quarterly Transition Rates (x100), December 1990 to March 1992 
(OM = Observed Mobility) 


BE EU EN UE 

D90-M91 WW 94.77 4.25 0.98 24.53 
BW 91.50 4.86 3.64 31.60 

M91-J91 WW 96.03 3.02 0.95 23:2) 
BW 91.48 4.63 3.89 35.01 

J91-S91 WW 94.29 3.94 7/7) 20.93 
$91-D91 WW 93:73 4.48 1.79 23.63 
D91-M92 Ww 93.90 4.80 1.30 21.67 


UU UN NE NU NN OM 
72.40 3.07 0.98 0.66 98.36 5.08 
56.84 11.56 4.40 2.10 93.50 10.16 
74.32 2.47 1.28 0.68 98.04 4.54 
54.20 10.79 4.84 2.14 93.02 12.04 
78.29 0.78 4.71 DOS) 92.34 7.85 
74.89 1.48 3) Pap 1.65 95.13 E25) 


76.74 LESS) 1.70 0.59 SILI Spoil 
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period. Constraints under (d) aim at catching the tele- 
scoping effect. 
Figure 5 gives the path diagram of the estimated model. 


W, W, 

oe) 

Vim) aa Gat Ams OG 
Wace" At aigae ell 8» Salle 2 ll 


(ve ee as ame F 


Figure 5. Path Diagram of a Modified Lisrel Model for Six 
Measurement Points and Two Indicators for One Latent 
Variable 


Table B.1 in Appendix B reports the pattern of 
restrictions on response probabilities, (a) to (e); it shows 
which parameters are set equal, and which are fixed to 0, in 
order to introduce into the basic model, as defined by 
equations (B1) to (B6), the above constraints. 

The final model has been selected after comparing a 
sequence of models, as can be seen from Table 5. 


Table 5 
Model Selection (EM = Estimated Mobility) 
MODEL iL” df INGE Rhea 
cond. test 
A 2509.5759 2124 5.424 
Al 3450.1716 2154 940.5957 0 4.918 
A2 3849.9470 2178 399.7754 0 5.798 
B 816.1620 2076 5.888 
Bl 855.2282 2094 39.0662 0.01 5.818 
B2 864.9657 2106 9.7375 0.40 5.906 
B3 879.5996 2121 14.6339 0.10 6.252 


We started the analysis by estimating a model based on 
the ICE assumption (model A in the table), which, as 
expected, shows a bad fit. 

The following models (Al and A2) are based on the 
work by Magnac and Visser (1995). These authors consider 
monthly transitions over a period longer than ours (from 
January 1989 to March 1992), but on the same sample of 
individuals. They assume that the labour force state in the 
interview month is correctly reported, while the probability 
of making mistakes increases with the distance between the 


ke, 


reference month and the time of interview, according to a 
deterministic function of time. Response probabilities are 
assumed to be constant over the survey waves, and true 
transitions are assumed to follow a first order stationary 
Markov chain. Our model A1 is a less restricted version of 
Visser and Magnac’s model — no stationarity assumption is 
made, applied to quarterly transitions from December 1990 
to March 1992. Our model A2 adds to model Al, the 
hypothesis of first order stationarity at the latent level. Both 
models perform quite badly, and (from column EM), we see 
that, on average, they correct the observed labour market 
towards stability: a result which contradicts the evidence on 
the effects of classification errors in retrospective surveys. 

Model B introduces correlation among classification 
errors, by letting each indicator to depend on the true 
transition that occurred between times ¢ and /¢+1; 
moreover, it encompasses constraint (a). The fit increases 
dramatically (see L*). All subsequent models are nested in 
model B, and additional restrictions may be evaluated by a 
conditional test. Model B1 introduces constraints under (b); 
model B2 the additional constraints under (c); and model 
B3 is our final model. 

Table 6 presents estimated transition rates with our best 
fitting model. The French labour market is corrected 
towards a greater mobility. The average estimated mobility 
amounts to 6.252%. Moreover, estimated response 
probabilities show a pattern consistent with the notion, that 
the probability of making mistakes gets bigger, the longer 
the recall period. 
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Table 6 
FLFS Estimated Quarterly Transition Rates (x100), December 1990 to March 1992 
(EM = Estimated Mobility) 


EE EU EN OE UN NE NU NN EM 
D90-M91 94.85 4.48 0.67 12.70 66.28 21.02 1.09 5 97.36 6.27 
M91-J91 95.65 eS 7) 2.98 28.43 62.35 9.22 3.61 1.48 94.9] 7.49 
J91-S91 93.71 4.25 2.04 14.88 82.50 2.62 4.11 3.49 92.40 7.70 
$91-D91 98.32 1.67 0.01 15.42 83.75 0.83 3.80 0.47 95713 4.24 
D91-M92 93°25 5.02 We) 9.99 88.65 1.36 2.07 1.28 96.65 5.56 
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APPENDIX A 


Final Model Specification for the SIPP Data, in Terms 


of Conditional Probabilities 


(1) Basic model decomposition 
z,= P(G =g) 


my} = P(y, =4,) 


Ji di-1 


Ti, = PCy, =3,1¥,-1 =A) baa 


Ld, Lain& 
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Fi LU, i l, ly, we ! ee - LNs “ae G = Si) 
[Be 


IJ 8 ' 
Iya = PCY, =1,|¥4 =i4, G = 8) 


mJ, M,.\J1.18 
wi 


=P, =m,\y, Shee Wt = Mi Vp) i PAS 2) 
f=1;2,3 


Ms J48 


Vw = P(W,=m,|¥4 =i» G =) 


(Al) 


(A2) 


(A3) 


(A4) 


(A5) 


(A6) 


(A7) 


g varies over 1, 2, 3 and 4; is ANd. fia Hl) Laos oaly, 
over the categories E, U andN, m,,t = 1, 2, 3,4, vary over 


the categories ‘Job’ and ‘No Job’. 


(2) Constraints on conditional probabilities 
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Final Model Specification for the FLFS Data, in 


(All) 


(A12) 


(A13) 


(A14) 


(A15) 


Terms of Basic Model Decomposition and Pattern of 


Restrictions on Parameters 


(1) Basic model decomposition 
J ; 
tp P( yi ay ) 
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(2) Pattern of restrictions on response probabilities 


Table B.1 
Month of Observation 
Probability of 
observing a state June 91 & 


Sept. 91 


Dec. 90 & 


March 91 
eas Dec. 91 


given a latent 
transition 
Elee 

Uke 

Nkee 

Eleu 

Ulu 


FY Fr Ost 


ex 

s 

= 
oo Sr or oe er ae ON A Oe rt ee ae reo 
Co Fry er se rl er ae ry er Se con Ch SS Se Ee eee Se tt 


OO; Sy Fr) hr oe Er ON a er er rer) 


Zé 
on 
3} 
Ne} 
No) 
Ne) 


Equal numbers indicate response probabilities fixed to be equal. 
* indicates a probability fixed to 0. 
F indicates a free parameter. 
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Estimating Labour Force Gross Flows From Surveys Subject 
to Household-level Nonignorable Nonresponse 


PAUL S. CLARKE and RAY L. CHAMBERS’ 


ABSTRACT 


Measurement of gross flows in labour force status is an important objective of the continuing labour force surveys carried 
out by many national statistics agencies. However, it is well known that estimation of these flows can be complicated by 
nonresponse, measurement errors, sample rotation and complex design effects. Motivated by nonresponse patterns in 
household-based surveys, this paper focuses on estimation of labour force gross flows, while simultaneously adjusting for 
nonignorable nonresponse. Previous model-based approaches to gross flows estimation have assumed nonresponse to be 
an individual-level process. We propose a class of models that allow for nonignorable household-level nonresponse. A 
simulation study is used to show, that individual-level labour force gross flows estimates from household-based survey data, 
may be biased and that estimates using household-level models can offer a reduction in this bias. 


KEY WORDS: Gross flows; Household-based surveys; Nonignorable nonresponse. 


1. INTRODUCTION 


Labour force gross flows are typically defined as 
transitions over time between the three major labour force 
states, employed, unemployed and economically inactive. 
Gross flows estimates are an important tool in the study of 
labour force dynamics (for example, see Vanski 1985). 
Large-scale on-going surveys such as the British Labour 
Force Survey and the U.S. Current Population Survey, 
provide data for gross flows estimation. However, non- 
response, measurement error, sample rotation and complex 
design effects, affect gross flows estimation from these 
surveys. A discussion of these and other factors affecting 
gross flows estimation, is given in Hogue (1985). Here we 
focus on the problem of nonresponse. 

We assume that a nonresponse mechanism leads to the 
observed data being incomplete. If the probability of not 
responding depends on the missing data, then the non- 
response mechanism is nonignorable (Rubin 1976). The 
model-based approach to analysing incomplete survey data, 
is detailed in Little (1982). Model-based approaches to the 
estimation of labour force gross flows, involve modelling 
both the labour force flows and the nonresponse 
mechanism, and simultaneously fitting both models to the 
incomplete data. Examples of such models are given in 
Stasny and Fienberg (1985), Stasny (1986) and, for 
nonignorable nonresponse, in Little (1985). We call these 
individual-level models, because individuals are modelled 
as responding or not responding, independently of other 
sampled individuals. 

Both the Labour Force Survey and the Current 
Population Survey, are examples of household-based 
surveys, that is, surveys based on a random sample of 
households, rather than individuals. Household-based 
surveys can lead to correlated nonresponse behaviour 


1 


within households. For example, in the Current Population 
Survey, a single household member (usually the head-of- 
household) acts as a proxy for the other household mem- 
bers; thus, if the chosen household member is a non- 
respondent, so are other household members. It follows 
that, due to correlated within-household nonresponse 
behaviour, individual-level nonresponse models are 
unsuitable for the estimation of labour force gross flows, 
using household-based survey data. 

In this paper, we propose a class of models for 
individual-level labour force flows, and household-level 
nonresponse, that account for correlated within-household 
nonresponse behaviour. A number of plausible nonresponse 
models that are estimable from the observed data, both 
ignorable and nonignorable, are also presented. We then 
simulate household-based survey data, using these house- 
hold-level models, to demonstrate the potential utility of our 
approach: first, individual-level labour force gross flows 
estimates are shown to be biased, when fitted to household- 
based survey data; and second, the bias of individual-level 
and household-level gross flows estimates are compared, to 
show the advantages of fitting household-level models to 
household-based survey data. To conclude, we summarise 
the findings of our simulation studies and discuss ideas for 
further research in this area. 


2. A MODEL FOR HOUSEHOLD-LEVEL 
NONRESPONSE 


2.1 The Data 


A gross flow is the probability or frequency of 
individuals in the population, making a state transition 
between two points in time, ¢, and t, (¢, < t,). Labour force 
gross flows refer to transitions between the three main 


Paul S. Clarke and Ray L. Chambers, Department of Social Statistics, University of Southampton, Highfield, Southampton, SO17 1BJ, United Kingdom. 
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labour force states: 1 = ‘employed’, 2 = ‘unemployed’ and 
3 = ‘not in labour force’, where the last category refers to 
economically inactive individuals, such as retired indi- 
viduals and students. Let S denote a simple random sample 
of households, indexed by 4. Within household /, there are n , 
eligible individuals, of which n,(ab) have labour force 
flow (a, b) between f, and t,, where )’, ,7,,(ab) =1,, and 
a,b =1,2,3. Werefer to {n, (ab)} as the complete data, 
that is, the frequencies that would be observed in the 
absence of nonresponse. 

Table 1 shows the complete labour force flows data for 
household / as a 3 x 3 contingency table. If / responds at 
both times, the observed data are the cells of this 2-way 
table. However, if the household does not respond at ¢, or t,, 
the observed data correspond to the margins of the table: 
n, (1+), 2,(2+), 2,(3+) are the observed data if h responds 
at ¢,, but does not respond at ¢,; and n,(+1), 
n,, (+2), n,(+3) are the observed data if h responds at t, but 
does not respond at f,. (An index replaced by ‘ +’ denotes 
summation over all levels of that index.) Furthermore, if h 
does not respond at both ¢, and t,, the observed data is the 
household size, n,, which we take to be known and fixed 
between ¢, and f,. 


Table 1 
Complete Labour Force Flows Data for Household 


Status 


2.2 Model Specification 


It is inappropriate to treat the nonresponse behaviour of 
individuals within a household as independent, in house- 
hold-based surveys. In the Labour Force Survey, for 
example, one eligible household member determines 
whether the household can be interviewed. Therefore, if no 
eligible individual can be contacted, each household indi- 
vidual is a nonrespondent. To construct a model for 
household-level nonresponse, we take the ideas behind 
individual-level nonresponse and extend them to the 
household, by considering a household to be an entity with 
its own nonresponse flow between 1, and ¢,. To allow for 
nonignorable nonresponse, the probability of a household 
nonresponse flow is modelled as a function of its individual 
labour force flows, as shall now be described. 

Let N, =(WV, (1D), N, (21), .... N,(G3)) be the random 
vector of labour force flows frequencies for household h, 
where N,(ab) is the random variable, whose outcome 
corresponds to the number of individuals with labour force 
flow (a, b), a, b =1, 2, 3. Further, denote the random vector 
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for the 
Ri =(k 


nonresponse flow of household h_ by 


ni Kya)» Where 


1, 
Ry=), 


is the nonresponse status random variable for h at 

t., 7 =1,2. The realisations of these random quantities are 
enoted by n, and r,. We now assume that n, and r, are 

known, and write the joint probability of N, and R, as 


if household responds at t, 


otherwise 


Pr(N,=",,R,=1r,)=Pr(N,=n,)Pr(R, =r, |N,=2,)> 


where Pr(NV, =n,) is the labour force flows model, and 
Pr(R, =r,,|N,=%,,) is called the nonresponse flows 
model. 

The labour force flows model is taken to be multinomial, 
with probability function 


i n,, (ab) 
Pr(N, =n, :0) an {T] 22 (1) 
a,b 1, (ab)! 


where w (ab) > 0 is the probability of an individual having 
labour force flow (a, b) and ays ,@ (ab) = 1. The vector of 
labour force flows parameters is denoted by @ = (@(11), 
@ (21), .... @(33)), of which 8 are free. The assumption of 
multinomial sampling in (1), implies that individuals’ 
labour force flows behaviour, is independent within 
households, and that households are homogeneous with 
respect to their labour force flows behaviour. These 
assumptions are unrealistic, but (1) can easily be extended 
to a more realistic model for the labour force flows, as we 
discuss in Section 4. 

The probability of household / having nonresponse flow 
(u, v), is taken to be 


m(uv|n,) =Pr(R, =(u,v)|N, =2,3W) 


- 1 n, ab) y(w | ab), (2) 


Ny, a,b 


for u,v =0,1, namely, a weighted average of the non- 
response model parameters. By setting n, = 1, it can be 
seen that y(uv | ab) > 0 is the probability of a household of 
size one (i.e., an individual) having nonresponse flow 
(u,v), given it has labour force flow (a,b). Thus, 
yy (wviab)=1 and w=(w(11]11), yl] 1)),..., 
y(00|33)) is the vector of nonresponse parameters, of 
which 27 are free. 

Before defining the likelihood function for the complete 
data, partition S into 4 mutually exclusive and exhaustive 
subsets 


Dia S,,U So U Si)U Soo 
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where S_, = {h:r, =(u,v)} is the subset of households 
with nonresponse flow (u,v). Thus, since S is a simple 
random sample of households, the likelihood function for 
the complete data is 


L@,w;{n,7r,})=]] [] 2,@.ys",,@v), ©) 


u,v hes), 


where L, (@, W; 7,,, (u, v)) is the contribution of household 
heS_, to the likelihood, the product of (1) and (2). 


2.3. Model Fitting 


2.3.1 Maximum Likelihood Estimation 


Since the complete data are unavailable, (3) must be 
modified to give the likelihood based on the observed data. 
Denote the observed data by {n,}. As discussed in Section 
2.1, the observed data for households that respond at i and 
t,, is the full cross-classification in Table 1, namely, 
n,=n,. Similarly, if heS,, then mj, =(n,(1+),1,(2+), 
n,(3+)); if heESo, then n, =(n (+1) n, (+2), 7, (+3)); and 
if heES,,), then nj), =n,. 

The contribution of household heS_,,, to the observed 
data likelihood, is obtained by summing Z ,(@,W 3H, (u, V)) 
over all possible values that the full 3x3  cross- 
classification of labour force flows can take, given the 
observed margin. Representing this set of tables by n,:n Pe 
the observed data likelihood for S is 


L(@, w; {n,,7,}) =] Il & L£,@. v2, v). @ 


u,v heS,,, n,n, 

Model fitting requires calculating (4) at each stage of an 
iterative optimization process. This is computationally 
intensive, because the complete data likelihood function 
must be summed explicitly over the missing data. For 
example, the observed data for heS,, is n,, = (n,(1+), 
n,(2+),n,(3+)) and the likelihood contribution of this 
household to the observed data likelihood is 


Y L,@,v;n,, (1, 0)). 


n,n, 


To explicitly calculate this contribution, each 3 x3 
complete data table n, for fixed nj, is generated and 
L,(@,;n,, (1, 0)) evaluated for each. For household size 
n,, = 5, there are at least 21 and at most 108 possible tables, 
depending on the values in the fixed margin; for 7, = 15, a 
very large household size, the respective numbers are 136 
and 9,261. A similar procedure is used for RESO except 
here n, = (n, (+1), n, (+2), 2,,(+3)) is the fixed margin. If 
RES 3 then no data about labour force status are observed, 
only the household size n,. So each 3 x 3 table with total n, 
must be generated, and the likelihood function calculated 
for each: for n, = 5 there are 1,287 tables and for mn; =15 
there are 490,314. It is not infeasible, in terms of computer 
run-time, to calculate such sums directly. The number of 
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explicit calculations can be reduced, by recognising that 
each household is defined only by its observed labour force 
flows frequencies and nonresponse flow. Thus, summation 
over the missing data need only be performed once for a 
household with a particular nonresponse flow and labour 
force flows frequencies; the contribution of this household 
to the likelihood is then raised to the power of the number 
of similarly defined households in S. 


2.3.2 Parameter Estimability 


If we fix n, =1 for all h, the complete data have no 
household structure, and form a 4-way table cross-classified 
by labour force status and nonresponse status at ¢, and t,. 
The observed data log-likelihood (4) is now equivalent to 
that of the individual-level models in Stasny and Fienberg 
(1985), Little (1985) and Stasny (1986). For these models, 
estimability requires that the number of model parameters 
does not exceed 15 (one for each observed table cell, less 
one for the multinomial sampling constraint). Hence, 
(@, W) are inestimable because there are 8 + 27 = 35 free 
parameters. Since interest is focused on the labour force 
gross flows probabilities, @, it is neccessary to constrain y 
to ensure estimability. 

When n, > 1, determining parameter estimability is more 
difficult, because (4) has a complicated closed-form 
expression. Fitzmaurice, Laird and Zahner (1996) use a 
numerical method to determine estimability, that involves 
showing that the information matrix is non-singular in the 
neighbourhood of the maximum likelihood estimate. 
However, not only is this impractical for problems of a high 
dimension, but evaluating the information matrix for the 
household-level model, is particularly difficult in this case. 
Instead, we adopt a pragmatic approach for determining 
parameter estimability: first, we restrict attention to models 
that satisfy the necessary condition for estimability when 
n, = 1; and second, different starting values are used to for 
each fit. If the different starting values reveal a non-unique 
maximum likelihood estimate, or any parameter estimate is 
unchanged from its starting value then the model 
parameters are taken to be inestimable. 


2.4 Nonresponse Models 


To enable parameter estimates to be obtained from the 
observed data, 8 and wy must be constrained in accordance 
with assumptions about the nonresponse mechanism. The 
nonresponse parameters are interpreted as individual 
nonresponse probabilities, but within the household frame- 
work established thus far, it is inappropriate to talk about 
individuals not responding. However, in reality, it is 
individuals within households that determine a household’s 
nonresponse flow, not the household itself. Therefore, 
constraints are placed on the nonresponse parameters at the 
individual level, that apply at the household level through 
the functional dependence of a(uv|n ,) on y in (2). For 
example, if the nonresponse parameters are constrained 
such that y (wv | ab) = y(uv) for all a, 5, then the household 
nonresponse mechanism is ignorable, because household 
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nonresponse flows are independent of the labour force 
flows. 

We now present four models for the nonresponse 
mechanism, two of which are ignorable, and two 
nonignorable. 


— Ignorable models. 
— Model J,: Constant nonresponse probability, 


y(uv| ab) =Ab"(1 - Aye KAYLA)”, 


which has 1 parameter, 4, the probability of an 
individual not responding; 

— Model J,: Independent of labour force status, but 
different nonresponse probabilities, at ¢, and ¢,, 


w(uv| ab) =21"(1 - 2)" x 671 - 6)”, 


which has 2 parameters, 2,6, the probabilities of 
nonresponse at ¢, and ¢,, respectively. 
— Nonignorable models. 
— Model N,: The nonresponse distributions at ¢, and t, 
are independent but depend on labour force status at f, 
and ¢,, respectively, 


y (uv | ab) = Aa)" - A(a))* x 8B)” (1 - 8 (B))" 


which has 6 parameters, 4 = (A(1), A(2), A(3)) and 
6 = (6(1), 6(2), 8(3)), where A(a) is the probability 
of not responding at ¢, , given labour force status a at f,, 
and 6(d) that at ¢,, given labour force status b at 6 

— Model N,,: The nonresponse distributions at ¢, and t, 
depend on labour force status at ¢, and ¢, re- 
spectively, i.e.,a first-order Markov process. Unlike N,, 
the nonresponse distributions at ¢, and ¢, are 
dependent: if the nonresponse status at ¢, is 1, then 
the nonresponse distribution at t, is the same as at 
t,; but if the nonresponse status at ¢, is 0, the 
nonresponse distributions are distinct, 


(uv | ab) = Aa)“ (1 - A(@))" 


A(by "(1 = AB)”, if w=, 
x 
0(b)*” (1. —-0())’s if w=, 


for a,b =1,2,3 and u, v = 0, 1. Under model J,, there are 
a total of 8 + 1 =9 free parameters, satisfying the necessary 
condition for estimability of an individual-level model. 
Models J ag and N ls have 10, 14 and 14 free parameters, 
respectively, and so also satisfy the necessary condition for 
estimability. 


3. SIMULATION STUDY 


3.1 Simulation Procedure 


We used a simulation study to investigate the conse- 
quences of failing to account for the household structure of 
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household-based survey data, and to compare labour force 
gross flows estimates for individual-level and household- 
level models. For this purpose, household-based survey 
data was generated using Monte Carlo sampling. Each 
sample data set consisted of 10,000 individuals arranged 
into households of size n, =k for all h. Within each 
household, labour force flows were generated from (1), and 
the nonresponse flow was generated from (2), under one of 
models N, or Nz. The data were made incomplete by 
collapsing each complete labour force flows data table, to 
be consistent with the household nonresponse flow. In 
total, 1,000 independent data sets were generated in this 
way. 

The population parameters used to generate the labour 
force flows are shown in the following table: 


1 0.43 0.245 0.035 
a 2 0.02 0.160 0.01 
3 0.015 0.035 0.05 


This is clearly a population in recession, since the 
probability of moving from being employed to unemployed 
is very large (w(12) = 0.245). Under models NV, and N,, 
the population parameters are 


It should be noted that these parameter values do not 
represent realistic nonresponse flows behaviour, they were 
chosen for the purpose of illustrating this methodology. 
However, this does not affect the general conclusions of the 
paper, which are also relevant for realistic values of the true 
nonresponse probabilities. 


3.2 Simulation Results 


Estimates for individual-level models are obtained by 
fitting (4) with n, = 1 to each incomplete data set. Figure 1 
summarises the sampling distributions of the individual- 
level maximum likelihood estimate of (12), (12), for 
nonresponse models J,, J,, N, and N, (estimates for 
ignorable models J , and J, are included together, because 
both yield the same estimates of the labour force flows). 
The vertical lines represent the intervals between the 
2.5-percentile and the 97.5-percentile of each estimate’s 
sampling distribution, and the bold point represents its 
median. There are three distributions obtained for each 
individual-level estimate: the left-most distribution is that 
when the household size is k = 1, i.e., the simulated data 
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have no household structure; and reading from left to right, 
the next two distributions are those obtained when the 
household size is k = 2 and k = 5, respectively. The solid 
horizontal line denotes the true flow probability, @(12) = 
0.0245. The behaviour of the sampling distribution of 
@(12) in this study, reflects that of the other labour force 
gross flows estimates. 

Figure 1a summarises the sampling distributions when 
N, is the true model. If the fitted individual-level model is 
I,, I, or N,, the labour force gross flows estimates have 
large biases, whatever the household size. As would be 
expected, the median estimate for correct model JN, is 
unbiased if k = 1 and a small bias is apparent for k = 2 and 
k =5 (although this bias is smaller for k = 5 than k = 2). 
Bias reduction with increasing k is also apparent for 
individual-level estimates /,, J, and N,. This behaviour 
is unexpected, since it seems natural to expect the bias of 
the individual-level estimates, to increase with the house- 
hold size. The results are slightly different in Figure 1b 
when JN, is true. Here the estimate for individual-level 
model NV , becomes more biased as & increases, but the bias 
decreases for mis-specified individual-level models 7 ye 2h 
and NV ,. Furthermore, the misspecified estimates for 7, and 
I, have a small bias, when compared to those for 
misspecified model N,. These results are discussed in 
Section 3.3. 


0.20 030 0.40 0.50 


IA and IB NA NB 


k=l kes ket 
* 


0.05 0.15 0.25 0.35 


1A and IB NA NB 


Figure 1. Sampling Distribution of (12) for Individual-Level 
Models J4,/g,.N4 and Ng When the True Nonresponse 
Model is a) N, and b) N, and the Household Size is 
Ka 2 52 


A comparison of the median estimates of (12) for the 
fitted individual-level and household-level models when 
N,, is true, is presented in Figure 2. There are four sampling 
distributions associated with each model: the first two 
represent those from fitting an individual-level nonresponse 
model, and a household-level nonresponse model, when the 
household size is k=2; and similarly, the next two 
distributions are those when the household size is 5. 
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For a particular pair of individual-level and household- 
level sampling distributions, it can be seen that the 
household-level estimate is less biased than its equivalent 
individual-level estimate, and the spread of each household- 
level sampling distribution, is narrower. The exception to 
this, is when fitting model Z,, where the household-level 
and individual-level distributions are identical. This 
equality occurs because the observed data likelihood for the 
individual-level and household-level models, are equivalent 
when the nonresponse model is ignorable. Another feature 
is that, if the nonresponse model is correctly specified, the 
household-level estimates are unbiased. 


IA and B NA NB 


Figure 2. Sampling Distributions of @ (12) for Individual-Level 
and Household-Level Models/,,/,,N, and N, When 
the True Nonresponse Model is N, and the Household 
Sizeis k =2,5. 


3.3 Summary 


The estimates of the labour force gross flows under 
individual-level models, are never less biased than those of 
household-level models, when fitted to household-based 
survey data in our study. It should be noted, that if the true 
model is ignorable, it is unnecessary to utilise a household- 
level nonresponse model, because the individual-level and 
household-level models are equivalent. For example, if J, 
is true, (2) reduces to 4“*"(1 - 4)!"""”, and (4) factorizes 
into two components, dependent on @ only and A only; the 
factor dependent on @ can be shown to be equivalent to 
that for the individual-level model, and thus the labour force 
flows estimates are the same. 

It appears, as the household size increases, that the bias 
of the labour force flows estimates decreases, if the true 
model is nonignorable. In fact, this result arises because we 
use (1) to generate the labour force flows, and not because 
the model estimates are unbiased for large n,. To see why, 
consider the household formation process, used to generate 
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each Monte Carlo sample: as n, increases, each household 
frequency tends to the same value, i.e., n, (ab) converges to 
n,@(ab); hence, 


dese n,@(a, b)w (uv | ab) 


Ny, a,b 


m(uv|n,) - 


= > w(ab) (uy | ab), 
a,b 


which is independent of a,, that is, the simulated 
household nonresponse mechanism is ignorable. Therefore, 
the labour force flows estimates are unbiased, because 
fitting the nonignorable models to the simulated data, yields 
parameter estimates that are consistent with ignorable 
nonresponse. To generate nonignorable household-level 
nonresponse, it is necessary to prevent n, (ab) > n,@(ab), 
by extending (1), to allow for differential labour force flows 
between households. Such extensions to the labour force 
flows model are discussed in Section 4. 

Figure 1b) shows two anomalous results that contradict 
the above explanation, when NV 3 is the true model. First, 
the bias of individual-level model NV 3 8 estimate, increases 
as n, increases. However, further simulations with 
household size n, = 10, revealed that the individual-level 
estimate bias is zero. Thus, asymptotic ignorable non- 
response is also evident when Np is true, but 7, must be 
large before its effect becomes apparent for individual-level 
model N,. Second, the bias of the ignorable individual- 
level model estimates is small, almost zero, when iN as 1S 
true. This small bias reduces even further as n, increases, 
in line with asymptotic ignorability, but we have yet to 
atrive at a satisfactory explanation as to why the ignorable 
models perform so well in this situation. Further study is 
necessary to investigate this finding. 


4. DISCUSSION 


In Sections 3 and 4, it is demonstrated by means of a 
simulation based study, that modelling household-level 
nonignorable nonresponse, when estimating labour force 
gross flows from household-based surveys, leads to reduced 
bias in the flows estimates, compared to those from 
individual-level models. If the nonresponse model is 
ignorable, it is unnecessary to use household-level models, 
because the individual-level and household-level models 
are equivalent. Furthermore, it is shown that controlling for 
household-level nonresponse does not necessarily remove 
all bias from the estimates of the labour force flows. 
Correct specification of the nonresponse model is still seen 
to be imperative, although taking the household structure of 
the data into account, may lead to a refinement of the flows 
estimates if the nonresponse model is misspecified. In 
particular, we show that household-level estimates are less 
biased than their equivalent individual-level estimates. 
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Our nonresponse model is an extension of the idea that 
nonresponse can depend upon the characteristics of a unit, 
in this case, the labour force flows of household members. 
Nonresponse in household-based surveys can occur for 
more than one reason, e.g., refusal, non-contact, moving 
house or sample rotation. The current model can easily be 
extended to model more complex nonresponse patterns, by 
specifying the nonresponse indicator as a polytomous 
variable, and parameterizing the nonresponse model in 
accordance with the complex nonresponse patterns. It 
should also be noted, that we do not assume that the 
household-level model is an accurate representation of 
household nonresponse behaviour; rather, we assume that 
the household-level model, offers an approximation of 
within-household nonresponse dynamics. 

An important problem, highlighted by the results from 
the simulation study, is our assumption that individual 
labour force flows behaviour is homogeneous within 
households. Clearly, this is an unrealistic assumption. The 
model is easily extended, by specifying the labour force 
flows and nonresponse flows probabilities, as regression 
models to accommodate individual-level, household-level, 
or higher level covariate information. For example, the 
labour force flows probabilities could be specified as a 
multinomial-logistic regression: 


@,;(4D) | (ab). (ab) 
TKK We + Py Xpj> 
o,,(11) 


where ,,(ab) denotes the probability of individual 7 in 
household h, making labour fu flow (a, b), x,, is a (row) 
vector of covariates, and (Bea A et ) are the regression 
coefficients for multinomial-logit eh b). However, fitting 
these models requires conditional independence assump- 
tions to be made, about the relationship between the 
distributions of the covariates, the labour force flows and 
the nonresponse flows, because the covariate information 
may be missing for nonresponding households. An 
alternative solution, is to allow for heterogeneous between 
household labour force flows, using random effects, by 
making assumptions about the distribution of between 
household differences. Fitting these models is also 
complicated and would require, for example, a Markov 
chain Monte Carlo procedure to perform the necessary 
integration. If Sis not a simple random sample, auxiliary 
design variables can be incorporated into the fitting process, 
using the regression framework just described. 
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Longitudinal Analysis of Swiss Labour Force Survey Data 
by Multivariate Logistic Regression 


PAUL-ANDRE SALAMIN! 


ABSTRACT 


In longitudinal surveys, simple estimates of change, such as differences of percentages may not always be efficient enough 
to detect changes of practical relevance, especially in sub-populations. The use of models, which can represent the 
dependence structure of the longitudinal survey, can help to solve this problem. One of the main characteristics observed 
by the Swiss Labour Force Survey (SLFS) is the employment status. As the survey is designed as a rotating panel, the data 
from the SLFS are multivariate categorical data, where a large proportion of the response profiles are missing by design. 
The multivariate logistic model, introduced by Glonek and McCullagh (1995) as a generalisation of logistic regression, is 
attractive in this context, since it allows for dependent repeated observations and incomplete response profiles. We show 
that, using multivariate logistic regression, we can represent the complex dependence structure of the SLFS by a small 
number of parameters, and obtain more efficient estimates of change. 


KEY WORDS: Longitudinal binary data; Multivariate logistic model; Labour force survey. 


1. INTRODUCTION 


One of the main objectives of the Swiss Labour Force 
Survey (SLES), is to produce estimates of change for the 
percentages of the population in different employment 
statuses. Typically, simple estimates of change, such as the 
difference of the percentages of employed individuals 
between two years, are calculated for the whole population, 
and for a large number of sub-populations. In general, this 
is unsatisfactory, as the estimates for the sub-populations 
may not always be efficient enough to detect changes of 
practical relevance. The work presented here was motivated 
by the question, whether the use of models, which can 
represent the dependence structure of the survey, could help 
to solve this problem. 

As the SLFS is designed as a rotating panel, we are 
dealing with longitudinal categorical data, for which a fairly 
large proportion of the response profiles, are incomplete by 
design. The focus of interest is on modelling marginal 
probabilities, namely, the probabilities to be in a given 
employment status, as a function of time and other 
covariates that define sub-populations. If the repeated 
observations of the employment status were independent, a 
natural approach would be to use logistic regression. The 
multivariate logistic model, introduced by Glonek and 
McCullagh (1995) as a generalisation of logistic regression, 
is attractive in this context, since it allows for dependent 
repeated observations and incomplete response profiles. 

The aim of this paper is to show that, the ability of 
multivariate logistic regression to model the complex 
dependence structure of the SLFS data, leads to more 
efficient estimators of change. Although we illustrate the 
method using the SLFS data only, it is clearly of wider 
applicability. 


1 


There are a number of important issues that are not dealt 
with in this paper. As the SLFS data come from a complex 
survey, it can be argued that any analysis should take the 
sampling weights into account (Pfeffermann 1993). Here 
we use the unweighted data only. However, it can be 
shown, using the pseudo-likelihood approach of Binder 
(1983), that multivariate logistic regression can be extended 
to that situation (Salamin 1998). Non-response is always 
of great concern in sample surveys. Here, we consider only 
the incomplete response profiles that arise through the 
rotation of the panel, in which case, the hypothesis of 
missing completely at random, is reasonable. Note 
however, that multivariate logistic regression, is flexible 
enough to incorporate extra parameters for the incomplete 
profiles, arising from panel, attrition. Thus, the individuals 
which dropped out of the panel, could also have been 
included into the analysis. Finally, it is well known that 
classification errors may introduce large biases in the 
observed response profile probabilities, see e.g., 
Pfeffermann, Skinner and Keith (1998). It would certainly 
be desirable to investigate how these biases affect the 
parameter estimates of multivariate logistic regression, 
which have interpretations in terms of marginal moments. 

Log-linear models and marginal models are closely 
related to multivariate logistic regression, and are further 
discussed in Section 3. Here we discuss briefly transition 
models, random effects models, and survival analysis, in the 
context of the SLFS. Under a transition model, see e.g., 
Diggle, Liang and Zeger (1994, Ch. 10) or Zeger and Liang 
(1992), the repeated observations of the employment status 
are correlated, because past employment statuses influence 
the present employment status. The focus of interest, are 
the transition probabilities between the different 
employment statuses, e.g., the probability of being 
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employed, conditional on being unemployed in the past. In 
the regression setting, the past responses are treated as 
additional explanatory variables. An important issue, is the 
determination of the number of past responses to include as 
predictors. If the model for the transition probabilities is 
correctly specified, we can treat the repeated transitions for 
an individual as independent events, and use standard 
statistical methods, such as logistic regression. Under a 
random effect model, see e.g., Diggle et al. (1994, Ch. 9), 
the probability of being in a given employment status, is a 
function of explanatory variables, where the regression 
coefficients vary from one individual to the next. This 
variability of the regression coefficients, reflects the natural 
heterogeneity of the individuals, due to unmeasured factors. 
Given the regression coefficients, the repeated observations 
of the employment status, are assumed to be independent. 
The correlation among the repeated observations, arises 
solely because we are unable to observe the true regression 
coefficients. This approach is most useful, when inference 
about individuals rather than population averages, is the 
focus of interest. In survival analysis, also called event 
history analysis in the econometric literature (Lancaster 
1990), the focus is on modelling the transitions between 
employment statuses over time, as a function of explanatory 
variables. Here, the exact time at which a transition takes 
place, is important. In the SLFS, the employment status is 
observed once a year. The changes in employment status, 
that took place during the year preceding the interview, can 
be reconstructed. However, since this reconstruction is 
based on the self-assessment of the subjects, there may be 
some imprecision as regards prior status, and time of 
change of status. An analysis of the SLFS data based on this 
approach can be found in Gerfin (1996). 

The article is organized as follows. We begin in Section 
2 by describing the data, a subset of about 5000 individuals 
from the SLFS, which are used in the examples of Sections 
4 and 5. In Section 3, we discuss multivariate logistic 
regression, and contrast it with the log-linear and marginal 
models. In Section 4, we illustrate the ability of multivariate 
logistic regression, to represent the complex dependence 
structure of the SLFS data, by a small number of para- 
meters. In Section 5, we compare multivariate logistic 
regression with a simple estimator of change. It is shown 
that, using multivariate logistic regression, results in a gain 
in efficiency. Finally, we present in Section 6 our 
conclusions, and give directions for further work. 


2. SWISS LABOUR FORCE SURVEY DATA 


A detailed description of the sampling design and 
weighting procedure of the SLFS, can be found in Hulliger, 
Ries, Comment and Bender (1997). Here, we just recall 
some of the relevant aspects of this survey. The SLFS 
collects information on the employment of resident persons 
of age 15 or more in Switzerland. Starting in the second 


quarter of 1991, a sample of about 16,000 persons are 
interviewed each year. The survey is designed as a rotating 
panel, with a time-in-sample of 5 years. During the start-up 
phase, i.e., from 1992 to 1996, approximately one fifth of 
the original sample was rotated out each year, and replaced 
by a renewal sample. The units in the renewal samples then 
stayed in the panel for a full period of 5 years. 

In the examples of Sections 4 and 5, we use the obser- 
vations of the employment status, for the years 1992 to 
1995, obtained from the individuals in the sample, of the 
canton of Vaud. The structure of the data, as well as the 
longitudinal and cross-sectional sample sizes, are shown in 
Table 1. Due to the sampling design, some of the response 
profiles are incomplete. For example, for the individuals 
that were selected in 1991 and rotated out of the sample in 
1994, the period of observation, denoted (1)234, goes from 
1991 to 1994. We use the notation (1)234, to emphasise the 
fact, that we do not use the observations taken in 1991. 


Table 1 
Structure of the Data, Longitudinal and Cross-sectional Sample 
Sizes Canton of Vaud, 1992-1995 


First year Observation times for various Period of 


in sample parts of the sample observation 

oN Oy (1)2 622 
92 93 (1)23 412 
92 93 94 (1)234 S27, 
92 93 94 95 (1)2345 481 

92 92 93 94 95 2345 612 

93 93 94 95 345 ie 

94 94 95 45 728 

95 95 5 877 


2,654 2,754 3,070 3,420 4,981 


Employment status is a nominal variable with three 
categories, defined as “employed’’, “unemployed” and “out 
of the labour force”. In the examples of Section 4 and 5, we 
work with a binary variable, taking the value 1 if an 
individual is employed, and 2 if an individual is 
unemployed or out of the labour force. This is done solely 
to simplify the presentation of the multivariate logistic 
models. As the method can handle an arbitrary number of 
categories, it would be preferable, not to collapse the 
statuses in a real analysis. Caution must be exercised, if it 
is nevertheless necessary to combine some of the statuses, 
as heterogeneity of the statuses may introduce bias. 


3. MULTIVARIATE LOGISTIC MODELS 


The multivariate logistic model, introduced by Glonek 
and McCullagh (1995), can handle multivariate responses 
of either nominal or ordinal types, and either discrete or 
continuous explanatory variables. Here, we consider only 
multivariate binary responses and discrete predictors. The 
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multivariate logistic model, is an example of a generalized 
linear model, see McCullagh and Nelder (1989). Its link 
function, also called the multivariate logistic transforma- 
tion, expresses the joint distribution of the response 
profiles, in terms of marginal moments of increasing order, 
the first two being marginal logits, and marginal log odds 
ratios. The link function has the property, termed 
reproducibility, that a multivariate logistic model, applies to 
any subset of the response vector. This property ensures 
that, the interpretations of the parameters are the same, 
regardless of the number of response variables, and whether 
or not higher order parameters are included. This makes 
multivariate logistic regression, especially attractive for the 
analysis of longitudinal data, where the repeated observa- 
tions of an outcome arise on an equal footing, and where 
the number of repeated observations may vary from one 
individual to the next. Reproducibility is also the key to the 
ability of the model, to accommodate observations with 
incomplete responses. Note however, that we need to 
assume, that the data are missing completely at random, if 
the same parameters are to be used to model the complete 
and incomplete response profiles. The parameter estimates 
are found by maximum likelihood. A key step, is the 
inversion of the multivariate logistic transformation. For 
more than three responses, this may not always be possible, 
as there are then constraints among the parameters (Glonek 
and McCullagh 1995, Liang, Zeger and Qagish 1992). 
Also, the presence of empty cells, may limit the order of the 
parameters that can be fitted. 

The log-linear model is widely used to model multi- 
variate binary data. In the saturated log-linear model, see 
e.g., Liang et al. (1992), the canonical parameter associated 
with a subset of the variables, has an interpretation in terms 
of conditional probabilities given the rest of the variables, 
e.g., the first and second order parameters are logits and log 
odds ratios, conditional on all the other responses. It follows 
that, the log-linear model is not reproducible, which makes 
it less preferable than multivariate logistic regression, for 
the analysis of longitudinal data. It is nevertheless possible, 
to build log-linear models that, as in the multivariate 
logistic model, have marginal logits as parameters. This 
leads to the marginal models (Diggle et al. 1994, Ch. 8). 
In these models, the dependence of the marginal proba- 
bilities on explanatory variables, is modelled separately 
from within-unit correlation. Under this approach, the 
parameters are not estimated by maximum likelihood. 
Rather, only the structure of the correlation, between the 
repeated observations of an outcome is specified, and the 
parameters are estimated by solving generalized estimating 
equations (GEE), a multivariate analogue of quasi- 
likelihood (McCullagh and Nelder 1989). A number of 
specifications of the correlation structure have been 
proposed, for example Liang ef al. (1992) use the marginal 
log odds ratios, as in Glonek and McCullagh (1995). We 
have made some comparisons between multivariate logistic 
regression and PROC GENMOD of SAS (release 6.12). 
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This procedure has the ability to fit correlated response 
models by the GEE method. We found very similar 
estimates of the marginal logits. The GEE method appeared 
to be slightly less efficient than multivariate logistic 
regression. A limitation of the GEE method is that, it cannot 
yield estimates of the response profile probabilities, but 
only of the marginal probabilities. By contrast, the multi- 
variate logistic model does not have this limitation, since its 
parameters are estimated by maximum likelihood. 
Following Glonek and McCullagh (1995), we discuss in 
Section 3.1 the multivariate logistic transformation, and we 
give, in Section 3.2, the algorithm for maximum likelihood. 


3.1 Multivariate Logistic Transformation 
Let Y,, Y,, .... ¥, be d repeated observations, taken at 
times t, <t,<...<t,, of the same binary variable, and let 


EG silos bay last). 


Bibi eed 
where Biot, won tg ate all either 1 or 2, be the joint proba- 
bilities of the random variables Y,, Y,,...,¥,. In the 
multivariate logistic model, the joint probabilities of 
Yas Y,, .... Y, are parameterized in terms of marginal logits, 
marginal log odds ratios, and contrasts of marginal log odds 
ratios. This parameterization can be written as n = 
C "log(Ln), where 7 is the vector of dimension g = 24 


m= (1 m 7 i) 


DN eon aaaed ten 2 cenwe maeeD ee? 


and where, the matrices LZ and C are tensor products of 
suitably chosen marginal indicator and contrast matrices 
respectively. The matrices L and C, which depend on the 
length d of the observation period, are defined recursively, 
beginning with Ly = Co =olesas 


ah 
Lo el, 


d ~ 
L,.,eL 


and 


Cr 0 


Cs Ae) 
POM OME HE: 


where iy =(1, 1), L is the two by two identity matrix, and 
C = (1, -1)" (Glonek and McCullagh 1994). 

To illustrate matters, we consider periods of observation 
of length d= 1,2,3,4. For d=1, n=(z,,7,)’ and n = 
(No 1,)’ = logx,, logit Y,)’, where the plus subscript 


indicates summation, and logit Y, is defined as 


P(Yj=1) tT, 


l reall 
——— =lo =log—. 
Pie) : ‘ 


logit..Y,.=1og 
Oo: 


eS 
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In that case the multivariate logistic transformation is 
equivalent to the usual logistic transformation. Note that, 
although the parameter n, = log, = 0 is strictly super- 
fluous, it is convenient to retain it, as a means of ensuring 
that the mapping 2 - n is of full rank, and also expressing 
the requirement that 7, = 1. 


‘f de ‘a 
For d= 2, 1. = (Hj 45, Kjos Mp9 Moy), and 


N= (Mo Ny Me Nig)” = 


(log x, ,, logit Y,, logit Y,, log OR(Y,, ¥,))" 


++? 


where 
ORWY5) = 
P(Y, = 1, ¥, =1) PQ, =2,¥,=2) 
PU Ha Vs 2 = 2a) OR 
is the odds ratio, a quantity measuring the association 


between the variables Y, and Y,. The parameters n, and n, 
are the marginal logits at times ¢, and ¢,, for example 


Tl . 


7, = logit Y, = log ———.. 
(at 


‘A " 4 
For d = 3, © = (M115 M49» +++ Mp1» M99) and 


TCs Wye Mes Migs Maile ioeineae es 


The parameters n,,1, and 1, are the marginal logits at 
times f,, ¢, and t,. The parameters Nh Nis and Mog are the 
log odds ratios of the corresponding two-dimensional 
marginal tables, for example 


Brngeizs Waco» 
Negi log, ORG Ye log ————. 
Mio iat 
The parameter 1,,, 1s a contrast of log odds ratios given by 


Ni3 = log OR(Y,, ¥,| ¥, = 1) - logOR(Y,, Y, | ¥; = 2) 


1,,,1 1,41 
111 "21 112™22 
= log ——_— - log ———. 
M191 M11 M199 M19 
S - T 
For d = 4, 1 = (4445 Myyp9» +++ My794> Myp77) and 


T= Mo» My Na» Nya» Ny» M3» No3> Nas 
Na Nar Nog N24? N34? M134? N234> se 
The parameters 7,, Nj and Nik? where 1 <i<j<k< 4, are 


defined as above, using the appropriate marginal tables. The 
parameter 1,,,, 1S a contrast of log odds ratios given by 


tiga,” =o ORO EY, | T= ew) 


- log OR(Y,, Y, | ¥,=1, ¥,=2) 
- log OR(Y,, Y,| ¥3=2, ¥,=1) 
+1og OR(Y,) ¥j/ ¥, 22) YS). 


A key step in maximum likelihood estimation is the 
computation of the inverse of the multivariate logistic 
transformation. To ensure that m>0, we work with 
m™=expv, i.e., we seek to solve for v in the equation 
n=C'log(Lexpv). In general, no explicit solution is 
available, so an iterative method must be used. In particular, 
the Newton-Raphson iterations can be applied as described 
below. For clarity, we define the two functions @(z) = 
Clog(Lx) and y(v) = p(expy). 

(i) Begin with an initial approximation vp. 

(ii) Then. take v,.j =V,7 [Dy(v,)]' (@(expv,) <n 
where Dy(v) is the Jacobian matrix of the function 
y(v), and iterate until convergence. 


The Jacobian matrices of the function @(z) and y(v) are 
given respectively by D(x)=C’(diagLn)'L and 
Dvw(v) = D@(expv): diag (exp vy). 


3.2 Maximum Likelihood Estimation 


For a binary response variable observed at d time points, 


there are g = 2” possible response profiles i = (Djaieeant Js 
where i,,7,,...,7, are all either 1 or 2. For each profile 
i = (i,, ..., 1), we define the indicator variable Y;__; , which 


is equal to 1 if the profile i has been observed, and 0 
otherwise. We then have 


POP SDP? Upstp sage han 


yy Ty elg 


Defining the g-dimensional vectors 


ih T 
ie Yap Yuya He? Yoo 21 Y59 27) 
and 
m= (n 1 TL 5 ant ae 
isle aieag Ecce? yy Oyo) Woh 1) 


we may then write Y ~ M(1,7), ie., Y is a multinomial 
vector with g = 24 categories, whose probabilities are given 
by the vector 7. 

The multivariate logistic regression models, are then 
defined to be those of the form n = XB where X is a q x p 
matrix of explanatory variables, B is a p-dimensional vector 
of unknown parameters, and n = C ‘log(Lx) = 9(n). 

If we let y be one observation of the random vector Y, 
then we may write the kernel of the log likelihood function 
as 1(B;y) =v logx(B) where, using the inverse of the 
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multivariate logistic transformation, we can express the 
joint probabilities m as a function of the unknown para- 
meter B, as x(B) = p '(XB). The score vector is given by 


s(B) = s(B, y, X) = Dx(B)’ (diag n(B))"'y, 


where Dx(B), the Jacobian matrix of the function 1(f), 
relating the parameter B to the vector of probabilities 7, is 
given by Dx(B) = [D@(@! (XB))] LX, and where D(x) = 
C' (diag Ln) 'L, is the Jacobian matrix of the link 
function. The information matrix is defined as &(B) = 
Es(B)s(B)’. Now it follows from the assumption on the 
distribution of Y, that E(YY? ) = diagz, from which we 
may deduce that 


$(B) = SB, X) = Dx(B)* (diagn(B))' Da(B). 


If we have n ene observations y, ~ M(1, ,), 

k=1,...,7, where. 1, = C' log (Ln,) = X,B, then the score 
vector and the information matrix are given by s(B) = 
ee s (B, Vue X;,) and 3 (B) = ee 3(B, X;,)- 

The maximum likelihood estimator of B is the solution 
of s(B) =0, that can be found by using the Fisher scoring 
algorithm which, starting from some initial value B), 
iterates the sequence B,., +B, + gi (B,5(B,,) until 
convergence. 

Incomplete response profiles can readily be incorporated 
into the analysis. In particular, if some subset of the 
response variables Y,, Y,, ..., Y, is recorded for a particular 
unit, then the probability distribution on that c-dimensional 
marginal table is multinomial, and, as a consequence of the 
reproducibility of the multivariate logistic transformation, 
a multivariate logistic regression model applies to the table 
of probabilities. Furthermore, the design matrix relating the 
marginal probabilities to B, is constructed by selecting the 
appropriate rows of the full design matrix, that would be 
used if complete data were available for that unit. 


4. MODELS FOR LONGITUDINAL 
DEPENDENCE 


In this section we illustrate, using the SLFS data of 
Section 2, how multivariate logistic regression can be 
applied to describe the dependence between the repeated 
observations of the employment status. We do not intend to 
carry out an exhaustive search for a best model, but rather 
to demonstrate the ability of the method, to represent a 
complex dependence structure by a small number of 
parameters. 

We consider 6 models of decreasing complexity, see 
Table 2. For all 6 models, we have one parameter for each 
of the marginal logits corresponding to a given observation 
time. Symbolically, this is denoted by n, =B,. Since the 
observation times are the 2nd quarter of the years 1992 to 
1995, we take i = 2, 3,4,5. Thus B,, say, corresponds to 
the logit of the probability of being employed in 1993. 
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Similarly, the indices for the higher order parameters run 
from 2 to 5. For model | we take a saturated model for the 
longitudinal dependence, i.e., we have one parameter for 
each of the interactions of order 2, 3 or 4 within each period 
of observation. For the models 2 to 5, we assume that the 
interactions of order 3 and 4, are all equal to zero. The 
longitudinal dependence is then described in terms of log 
odds ratios only. For model 2, we take a saturated model for 
the log odds ratios. In model 3 we drop the covariate period 
of observation, i.e., we suppose that the log odds ratios are 
the same for all the periods of observation. In model 4, we 
use stationary log odds ratios, i.e., log odds ratios which 
depend only on the difference between times of obser- 
vation. Note that the parameter y, in model 4, corresponds 
to the constraint B,, = B,, = B,, on the parameters of model 
3, and similarly for y, and y,. In model 5, a linear model 
for the stationary log odds ratios is assumed. In model 6, 
finally, we assume that the observations taken at different 
times, are independent. Note that, in that case, multivariate 
logistic regression is equivalent to ordinary logistic 
regression. 


Table 2 
Six Models for Longitudinal Dependence 
Parameters 

rd th 
Meee ook cei i ane 
1 1, =B; y= By periog Thik = Bik perioa > Vijkt = Bijxt, period 
2 1:=B; y= By periog Thy = > Ni = 
3 n=B, n,=B, Nye = 9+ Niyxr = 9 
4s doen Bye (nein i Nin = 9: pie 0 
5 hy Py eeT Oya 7 |) =, =O, T= O 
6 n= B; n, =9 Nik = 0, Nyy = 9 


The parameter estimates for the models 2 to 6, are given 
in Table 3. The number of parameters and the values of the 
log likelihood function at the maximum likelihood esti- 
mates, can be found in Table 4 where, for comparison, we 
also included the log likelihood for the fully saturated 
model. 

Overall, we notice that the assumed form of the 
longitudinal dependence, appears to have little effect on the 
estimates of the marginal logits. This is a desirable 
property, as the marginal logits would typically be the 
parameters of interest. The standard errors of the marginal 
logits, are almost the same for the models that take into 
account the longitudinal dependence, but are inflated by 
about 15% for ordinary logistic regression (model 6). It can 
also be shown that the estimates of the marginal logits are 
positively correlated under the models that assume a 
longitudinal dependence, and uncorrelated for ordinary 
logistic regression. For the example considered here, the 
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correlation was found to lie between 0.4 and 0.8. Thus, 
modelling the longitudinal dependence, leads also to more 
efficient estimates of the difference of marginal logits. 

It can be seen from the fit of model 1, that the interaction 
parameters of order 3 and 4, are not significantly different 
from O. This suggests that the longitudinal dependence can 
be described by the log odds ratios only. This hypothesis is 
corroborated by the incremental deviance of model 2 with 
respect to model 1, which is found to be 7.9, on 12 degrees 
of freedom. Further, all the parameters of model 2 are 
significantly different from 0, and an examination of the 
standardised residuals for the fitted probabilities of the 
response profiles, does not reveal any anomaly. For 
applications in official statistics, model 2 would be the 
preferred model, since it is based on as few assumptions as 
possible, while still allowing a substantial reduction in the 
number of parameters, thus rendering less acute the danger 


of sparse tables when longer periods of observation and 
models with more covariates are considered. 

The models 3, 4 and 5 show that, it would nevertheless 
be possible to greatly simplify the description of the 
longitudinal dependence, without losing too much infor- 
mation. In going from model 2 to model 5, we observe that 
the deviance from the fully saturated model, does not 
increase much, see Table 4. Further, an examination of the 
residuals shows that, the models 3, 4 and 5 fit the data 
almost as well as model 2. On the other hand, while model 
2 requires 20 parameters to describe the longitudinal 
dependence, model 5 needs only 2 parameters. This must be 
contrasted with model 6, which assumes independence 
between observations taken at different times: the log 
likelihood is much smaller than for the fully saturated 
model, see Table 4, and the fit to the data is poor. 


Table 3 
Parameter Estimates and Standard Errors 
Parameter Period Model 2 Model 3 Model 4 Model 5 Model 6 
logit 92 0.6348 (0.0350) 0.6360 (0.0352) 0.6348 (0.0352) 0.6347 (0.0352) 0.6471 (0.0409) 
logit 93 0.5555 (0.0335) 0.5570 (0.0338) 0.5597 (0.0335) 0.5601 (0.0335) 0.5509 (0.0396) 
logit 94 0.5440 (0.0324) 0.5407 (0.0325) 0.5402 (0.0326) 0.5397 (0.0325) 0.5377 (0.0374) 
logit 95 0.4699 (0.0317) 0.4711 (0.0320) 0.4710 (0.0320) 0.4712 (0.0320) 0.4705 (0.0351) 
B., (1)23 4.2563 (0.3311) 4.2579 (0.1465) 
(1)234 4.2003 (0.2894) 
(1)2345 4.0859 (0.2954) 
2345 4.4830 (0.2841) 
Ba (1)234 4.0894 (0.2794) 4.1111 (0.1310) 
(1)2345 3.9611 (0.2840) 
2345 4.0989 (0.2600) 
345 4.2490 (0.2468) 
Bas (1)2345 5.3992 (0.3854) 4.5561 (0.1389) 
2345 3.9779 (0.2544) 
345 4.7288 (0.2735) 
45 4.5069 (0.2600) 
Bo, (1)234 3.7168 (0.2641) 3.8371 (0.1442) 
(1)2345 4.2560 (0.3059) 
2345 3.5330 (0.2370) 
ie (1)2345 4.4000 (0.3098) 3.7913 (0.1334) 
2345 3.6493 (0.2396) 
345 3.6116 (0.2192) 
B55 (1)2345 4.3984 (0.3173) 3.5774 (0.1530) 
2345 3.2209 (0.2256) 
Y 4.3260 (0.0928) 
Y2 3.8519 (0.1050) 
¥3 3.5340 (0.1495) 
) 4.7341 (0.1266) 
Y -0.4191 (0.0653) 
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Table 4 
Number of Parameters and Value of the Log Likelihood 
Function at the Maximum Likelihood Estimates 


Number of parameters of 


shel a ita 
1 2, 3 4 Total 
FullModel 20 20 £10 2 5D 53407 
1 an Magna gnlars 36 -5345.4 
2 4 20 He EG 24 = --5349.4 
3 4 6 mer) 10 -5365.2 
4 4 3 0 0 7 -5368.9 
5 4 2 Owe 0 6 — -5369.5 
6 4 0 pining, gree s1s.3 


5. COMPARISON WITH SIMPLE ESTIMATE 
OF CHANGE 


In this section we concentrate on the estimation of the 
difference of the probabilities of being employed between 
any two given years. We show that, estimates based on 
multivariate logistic regression, are more efficient than 
simple estimates defined as the difference of the propor- 
tions of employed individuals. 

The model considered here, is the model 2 of Section 4, 
with sex as an additional explanatory variable. We have, for 
each sex, one parameter for each of the marginal logits 
corresponding to a given year. The longitudinal dependence 
is accounted for by a saturated model for the log odds 
ratios. The third and fourth order parameters are set to 0. 
This model has therefore 8 parameters for the marginal 
logits, and 40 parameters for the log odds ratios: 2 sexes 
x 20 odds ratios within periods of observation, see Table 3. 
By inverting the multivariate logistic transformation, 
estimates of the probability of being employed, and of their 
differences between any two given years, can also be 
computed. 

A simple estimator of change is given by the difference 
of the proportions of employed individuals between any two 
given years. Its variance, which takes into account the 
overlap of the two samples, is given by 


1 


Nier 


1 
sed Maly re) Rel = he) 
n+c 


n 
-2 = ; 
(RG (™,, - ™,%,,) 


where vn is the number of cases for which observations are 
available for both years, r and c are the number of cases for 
which observations are available for only one year, 7, , is 
the probability of being employed during both years, and 
m,, and m,, are the marginal probabilities of being 
employed. 

Tables 5 shows, for the SLFS data of Section 2, the 
estimates of the difference of the probability of being 
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employed under both methods. Note that both methods 
yield similar estimates of change. The standard errors of the 
simple estimates, are on the average, 30% larger than for 
multivariate logistic regression. The corresponding mean 
relative efficiency of multivariate logistic regression, with 
respect to the simple estimates, is 1.7. By comparison, the 
mean relative efficiency of multivariate logistic regression 
with respect to ordinary logistic regression, is 3.2. 


Table 5 


Change in the Probability of Being Employed 


Canton of Vaud, 1992-1995 


Multivariate 
Comparison logistic Simple estimate 
regression 

Woman 92 vs. 93 0.0138 (0.0090) 0.0136 (0.0115) 
92 vs. 94 0.0184 (0.0102) 0.0168 (0.0134) 

92 vs. 95 0.0375 (0.0109) 0.0356 (0.0149) 

93 vs. 94 0.0047 (0.0087) 0.0031 (0.0107) 

OS T5905) 0.0238 (0.0095) 0.0219 (0.0128) 

94 vs. 95 0.0191 (0.0076) 0.0188 (0.0100) 

Men 92 vs. 93 0.0220 (0.0095) 0.0283 (0.0116) 
92 vs. 94 0.0245 (0.0102) 0.0334 (0.0133) 

92 vs. 95 0.0387 (0.0106) 0.0452 (0.0144) 

93 vs. 94 0.0024 (0.0092) 0.0052 (0.0111) 

93 vs. 95 0.0167 (0.0098) 0.0169 (0.0130) 

94 vs. 95 0.0143 (0.0080) 0.0117 (0.0102) 


6. CONCLUSIONS 


The analyses of the SLFS data presented here, have 
shown the usefulness of multivariate logistic regression. 
Modelling the longitudinal dependence is necessary, in 
order to obtain a satisfactory fit of the observed response 
profile probabilities. Ignoring the longitudinal dependence, 
we still obtain acceptable point estimates of the marginal 
logits, but the information on the detailed structure of the 
data is lost. Modelling the longitudinal dependence leads 
also to more efficient estimates of the marginal parameters 
and of change, when compared with ordinary logistic 
regression, and a simple estimator of change. Finally, the 
ability of multivariate logistic regression to represent a 
complex dependence structure, by a small number of 
parameters, has also been illustrated. 

Using the results of Glonek and McCullagh (1995), it is 
possible to extend the examples presented here, to 
multivariate responses of either nominal or ordinal types, 
with either discrete or continuous explanatory variables. 
The method can also be extended, to take the sampling 
weights into account (Salamin 1998). For the SLES, it was 
found that the sampling weights have little effect on the 
parameter estimates of the multivariate logistic model. The 
standard error of the parameter estimates, was inflated by 
about 15%. This moderate increase of the variability of the 
parameter estimates due to the sampling weights, is plausible. 
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Indeed, as in the SLFS, only one person per household is 
selected, a large cluster effect was not expected. 

Apart from the sort of analyses presented here, multi- 
variate logistic regression may also be used for modelling 
non-response probabilities in longitudinal studies. Such 
models may be useful when the sampling weights need to 
be adjusted for non-response. The ability of multivariate 
logistic regression to give a parsimonious model of the data, 
may also be of interest in small-area estimation. In partic- 
ular, estimators for a given geographical region could be 
based on models for an appropriately chosen larger region. 

Although we did not encounter serious problems in the 
examples presented here, further work may need to be done 
on the problem of sparse tables. A critical step, when there 
are a large number of empty cells, is the inversion of the 
multivariate logistic transformation. The approach of Lang 
(1996), where the inversion of the link function is avoided, 
by specifying the models through constraints, may be of 
interest in this context. Another area of investigation is the 
influence of the classification errors on the parameter 
estimates of the multivariate logistic model. 
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Price Index Surveys as Quasi-Longitudinal Studies 


ALAN H. DORFMAN’ 


ABSTRACT 


To calculate price indexes, data on “the same item” (actually a collection of items narrowly defined) must be collected across 
time periods. The question arises whether such “quasi-longitudinal” data can be modeled in such a way as to shed light on 
what a price index is. Leading thinkers on price indexes have questioned the feasibility of using statistical modeling at all 
for characterizing price indexes. This paper suggests a simple state space model of price data, yielding a consumer price 


index that is given in terms of the parameters of the model. 


KEY WORDS: Random walk plus noise model; State space model; Laspeyres index; Paasche index; Geometric price index. 


1. INTRODUCTION 


Survey sampling for calculation of a consumer price 
index is characterized by following a given item across time 
to determine its prices at a succession of times. Only it is 
not, typically, exactly the same item that is followed — it is 
not this particular can of Brand Y Tomato Soup at Outlet Z 
the price of which is repeatedly ascertained, for this 
particular can is likely to have been sold and consumed, by 
the time of the next visit of the survey sampler — but rather 
a succession of items, each fitting the same description 
(“Brand Y 8 oz. Can of Tomato Soup with Herring, sold at 
Outlet Z’”’), the price of which is collected at different times. 
In other words, it is essentially a group of items fitting a 
narrow description which is followed across time. For this 
reason consumer price index surveys may be termed 
“quasi-longitudinal” as opposed to longitudinal surveys, 
which would follow individual items across time. 
Nonetheless, it is reasonable to hope that, having repeated 
measurements across time might lead to estimation 
procedures which could capitalize on the time series aspect 
of such surveys. 

In the light of that hope, this paper considers a question 
which has by and large been ignored by statisticians and 
economists, or, when not ignored, been answered in the 
negative: Can a consumer price index (CPI) be treated from 
a Statistical point of view? That is, can the parameter, which 
characterizes the “change in the cost of living” from one 
period to another, and which price index surveys attempt to 
estimate, be defined in terms of a stochastic model? 

Aldrich (1992) gives an historic interpretation of early 
attempts by Jevons and especially Edgeworth, to 
incorporate distributional assumptions into the CPI. Recent 
papers on stochastic modeling of the CPI, are those by Balk 
(1980), Clements and Izan (1981,1987), Bryan and Cechetti 
(1993), Kott (1984) and Selvanathan and Rao (1994). 
Diewert (1995) reviews and criticizes these attempts, taking 
an argument of Keynes (1930) as decisive grounds for 
rejecting the stochastic approach. 


In this paper, a specific approach to modeling the price 
index using state space models is suggested, and a specific 
state space model tentatively suggested. This model is 
applied to scanner data to demonstrate the feasibility of an 
index based on it. The approach we contemplate, circum- 
vents the Keynesian criticism in fundamental ways, and 
offers the prospect of the many advantages that sound 
statistical modeling can bring, including, possibly, simpli- 
fications of the survey sampling process. 

In what follows, we first briefly review the definition of 
a price index, and the two (non-stochastic approaches) 
which have dominated consideration of choice of index 
(Section 2). We review the Bryan and Cecchetti (1993) 
example of a statistical model for the price index, and 
Diewert’s formulation of Keynes’ objection (Section 3). 
We then introduce an approach to modeling a consumer 
price index, that circumvents the Keynes-Diewert 
difficulties, and that leads naturally to the use of state space 
models (Section 4). We present results of applying a rela- 
tively simple random walk plus noise model to scanner data 
from the A.C. Nielsen Academic Data Base (section 5). 
We assess the new index in Section 6, mentioning further 
research that might be useful. 


2. BACKGROUND 
What is meant by a Consumer Price Index (CPI) is a 
single number indicating how the purchasing power of the 
consumer has changed from one period f’ to another ¢. Its 
raw ingredients consist of prices for the variety of available 
items at (at least) the two time periods 
ip = (Pas SAD» )3 = te t 


as well as quantities of the items sold 


I, = Gero Gy tat 
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(Often however in practice quantity data from the periods 
in question are unavailable, and one makes do with some 
form of surrogate.) The CPI is derived from a “formula” 
that uses these raw ingredients: 


Li =P P;2 Vy q1)> 


where /(-) is a function of one of many possible forms. 
Most such forms have a long history, and have been 
extensively discussed in the index literature. 

As examples, we mention here the Laspeyres index 


with 7, = Capi G;P,; the “relative expenditures”, 
and r,,, = D,;/P,,; the “price relatives”. The Laspeyres index 
uses the quantities from the earlier time period, as a fixed 
basis of comparison of the earlier and later prices. The 
Laspeyres index (or a close variant) has tended to be the 
index most targeted by governments, because of its 
simplicity and intelligibility to the layperson. 

The natural counterpart to the Laspeyres is the Paasche 
index 


which standardizes the prices by the later period quantities. 
Most indices following other formulas will tend to fall 
between the Paasche and Laspeyres. 

For later reference in this paper, we mention an index 
based on the geometric mean, with fixed non-negative 
weights f,, adding to 1: 


Ij 
N pele 
earn a G 


ee Pe 


This is sometimes referred to as the “Geomean”. 

Fisher (1922) discusses these and many other index 
formulae. He introduces what has come to be called the 
“Test Approach”, for deciding among the variety of 
candidates for the formula /(-): this approach lays out 
properties (“‘tests”), which a reasonable index would seem 
to require, and then examines to what extent each index 
formula satisfies them. 

One of the tests is the Time Reversal Test: J, J, = 1. 
Two indices which continue to exercise their sway in the 
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world, but fail this test are, the Carli-Sauerbach index 
On yy £2, /P,, and a geomean G,,, = These. ipe 
which employs first period expenditures instead of fixed 
weights. One readily shows that C,,,C,, 21, using the 
Cauchy-Schwartz inequality, suggesting that this index will 
run too high. 

If an increase in prices on item / tends to give an increase 
in expenditure share, then G. Ge < 1, so that under such 
conditions, the first-period-geomean tends to run too low. 
If an increase in prices on item i tends to give a decrease in 
expenditure share, then G., runs too high. In general, we 
can expect this to be a rather erratic index. 

This suggests the following maxim: price indices of the 
form of a geometric mean, should not have weights tied to 
prices at one of the periods being compared; those of the 
form of an arithmetic mean should not have weights 
independent of those prices. 

By contrast with G,,,, the geomean G,,, = ae (Dyl Po)" 
which has fixed weights, is the unique index which satisfies 
the five axioms on price indices in Balk (1995), and the 
“circularity test”, which says that, for t’<t*<t, I, = 
Tite taj Time reversal is an immediate consequence. 

Indices which pass most of the tests, tend to be ones 
incorporating quantity information from both periods; for 
example, the Fisher index 


Fy, a (L,, ae hl 


and the Tornqvist index 


with f= (tf), +f)/2. The Fisher and Tornqvist are 
frequently practically indistinguishable. Further discussion 
of the test approach, may be found in Balk (1995), Diewert 
(1987), and Eichhorn and Voeller (1976). 

The second approach to assessing index formulas is the 
“economic” approach. This defines a generic index of the 
form 


where U = U(q,, ..., Gy) iS a well-defined “utility function”, 
and C(p,, U) is the minimal cost at prices p,, of achieving 
the standard of living, or “utility” U. For a particular utility 
function U, one inquires whether a particular formula can 
be regarded as a good approximation to the corresponding 
cost of living index. Like the test approach, this tends to 
yield indexes incorporating quantity information from both 
periods. See Diewert (1987) for further elaboration. 
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3. THE STOCHASTIC APPROACH 


Aldrich (1992) gives the early history of attempts to 
model price relatives or logarithms of price relatives, using 
a common parameter that represents the overall rate of 
growth in prices. A basic theme of his paper is, that the 
stochastic approach to price indices, while being an early 
example of the application of statistics to economic 
concerns, died a natural death. Diewert (1995) also discusses 
these, as well as more recent examples of the statistical 
modeling of price relatives. The difficulty which, following 
Keynes (1930), Diewert finds with such use of models is 
exemplified by a model of Clements and Izan (1987). 

The period from ¢’ to ¢ is divided into equi-temporal 
pieces, giving relatively short intervals generically 
represented as being from ¢- | tot. The logarithm of the 
price relatives for such a “micro-period”, is given by 


| =, +B, +8,, (1) 


with €,, ~ (0, o /f,). In their model, the f,’s are the average 
expenditure share of the i-th item over the period f¢’ to ¢. 
For the sake of identifiability, it is assumed that 

N 
YVi-fB,=0. These assumptions lead to a maximum 
likelihood estimator 


giving an MLE of the price short period price trend as 


N Si 
exp (z,) = Il es 4 


i=1 \ Py; 


that is, based on their stochastic model, one derives a 
geometric index, with weights 7, akin to that for the 
Tornqvist. 

Estimates of the B, and of o” can also be derived, as well 
as estimates of precision, for example, of the variance of 7. 
Thus, a new statistical foundation seems to be put under an 
old estimator. 

Diewert (1995) raises several objections, none of which 
can be taken lightly. The chief of these is 


“.. the fundamental objection of Keynes 
(Keynes 1930, p. 78): ‘The hypothetical 
change in the price level [exp (z,)] which 
should have occurred if there had been no 
changes in relative prices, is no longer 
relevant if relative prices have in fact 
changed — for the change in relative prices 
has in itself affected the price level’.” 


If, say, the price of bread relative to the price of 
automobiles changes, then by that very fact, the overall 
price level changes. 
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Keynes’ objection is not entirely clear. Why can’t there 
be two aspects of price change, one overall, and the other 
particular? However, it is not hard to agree that the indi- 
vidual price trends are primary; an overall price trend can 
only be some weighted sum of these. Diewert offers the 
following translation into terms of a model, of Keynes’ 
objection. Since we must have the overall price trend of the 
form 


N 
* * 
t, = >) f,Ba> 
i=1 


the model (1) needs to be replaced by 


log | # = 1, +B, + &,;> (2) 
Pr-1,i 

with B,, = , - B,, and le f,B,, =0. The crucial difference 

between this and (1) is that now the item parameters B,, are 

indexed by time. But “then the resulting model has too 

many parameters to be identified.” This would suffice to 

nullify the approach. 

Diewert (1995) does not discuss the much more 
complicated time-series model of Bryan and Cecchetti 
(1993). Of preceding papers, it is probably the closest to 
our present paper, involving a complicated state space 
model and use of the Kalman Filter. Like the other papers 
Diewert reviews, it is subject to Keynes’ objection. 


4. PRICE INDEXES RECONSIDERED 


4.1 Common Presuppositions 


The stochastic modeling of price behavior given in the 
last section, whether embodied in equation (1) or (2), or 
some similar model, has three notable characteristics; the 
modeling is: 

1. Comprehensive in the sense that it aims straight for an 
overall “inflation rate” encompassing all items. 

2. Atomistic: every item is modelled individually, having 
its “private” parameter, its own rate of inflation 
[exp(z, + B,)], apart from all other items. 

3. Time isolated: price relatives modeling for period ¢ - 1 
to ¢is disjoint from that for period ¢- 2 to ¢- 1 etc. 


It is the combination of these suppositions that yields 
Diewert’s “over-parameterized” argument. The primary 
thrust of Keynes’ criticism is against 1: an overall inflation 
rate or rise/fall in the cost of living has to be a weighted 
mixture of several price trends. This may be granted 
without going so far as to embrace item 2. Item 2 is tacitly 
accepted in almost all (non-stochastic) constructions of 
price indices. However, it is not at all clear that every 
single item has its unique price trend. Different items (for 
example, Brand X ice cream at several supermarkets) are 
likely to have a tendency to rise and fall together (at least in 
the long run). There are degrees of homogeneity between 
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items. In any case, none of these assumptions is a necessary 
component of a stochastic view of price indices. 


4.2 An Elementary State Space Model 


We divide the time period ¢’ to ¢ into sub-periods ?’, 
t' + 1,...,¢- 1, t, and the collection of heterogeneous items 
into homogeneous sub-groups g, where the defining 
characteristic of homogeneity is a tendency to similarity of 
price change behavior. We make two assumptions: 
1. J,, is a mixture of “homogeneous” indices J 


DZ if, , can be attained through chaining: di 
where t=f +15... 

We focus on a single group index J oe dropping the 
subscript g for simplicity of notation. Thus, for the 
remainder of this paper, we focus on the “sub-index” 
we 

‘We. proceed to develop an elementary state space model 
(Harvey 1990, Chapter 3) for the logarithms of the 
within-group price relatives. Suppose the group contains n 
items. For i = 1, ..., , let r,, = p,/p,_,, be the micro-period 
price relatives, and y, =log(p,/p,_,,) = log(p,) - 
log(p,_,,)» their logs. The reason for using logs is that 
considerable empirical work, beginning with Edgeworth 
(see Diewert (1995)), suggests that the logs of price 
relatives will be much closer to having a normal distribution 
than the price relatives themselves, which can be 
considerably skewed. Normal distribution of errors is a 
standard assumption in state space models. Let 
VY, = (Vy +» Vp, ) and 1 be a vector of ones of length n. 

Consider the multivariate random walk plus noise 
(RWPN) model 


y,=1p,+8,, &, ~ MVN(0, Sr) 


gt't? 


= ee PH leo 


pila ila eases Bee @) 
with a te(t’,t’ + 1,...,¢-1,¢) mutually independent. 
The model implies that the amount that overall group prices 
are rising (or falling) in one micro-period, tends to hover 
around the amount they tended to rise (or fall) in the 
previous micro-period. This is a matter of common 
observation: if the price rise in one month tends to be high 
(low), then in the next month it tends to be correspondingly 
high (low). Since we are considering a homogeneous set of 
items, it makes sense that their log price relatives have a 
common mean. We leave for later work, the question of 
how to join sub-indices into an overall index. 

The model (3) implies the simpler univariate RWPN 
model 


Y= UW; Hep & > N(, 5;¢) 
(0, Syn) (4) 


soy intuay ooh nl A ed Pos 
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with y,=nll'y,,€,=n'l’e, and 6,,=n '1’¥,,1. 
Some information is thrown away in using (4); on the other 
hand, the normality assumption is even more likely to hold. 
For convenience, calculations in the study described in 
Section 5, were based on the univariate model. 

The Kalman Filter (Harvey 1990, Section 3.2) can be 
used to give estimates fi,, and 6-,, os of the state 
parameters u1, and the variances o,,, 10, respectively. 

Then we define J, = = EG, ; i S,), where G,,= 
Li xpyp, aye isa geomean dependent on fixed shares f,, 
and S, represents the totality of state parameters u, through 
time t and also the “hyperparameters” o--,6,,. In other 
words, we condition on what we take to be the underlying 
process through time ¢. Then 


1 
Lop i exp Baal rsp beat a 1.) ’ (5) 


where v = (t- t'))’.)'.6,, ff, with 6,,, the covariance of 
€,, and e€,, typically of lower order than the state para- 
The natural estimator of J, is J,,= 


meters [,. 
exp(,+f,_, +... f,,)3 then 


|. 


where V, given in the Appendix, does not in general equal v, 
but is frequently close, and in any case is of the same order 
of magnitude. The difference A(v)=¥-v can be 
estimated, by say A (v), yielding a bias-corrected estimator 
iy ef, , exp(- 1/2A(v)). Expressions for v and ¥V, anda 
suggestion for a maximal A(v), are given in the Appendix. 
It may be noted that Aq), and hence ii; ,. depends on the 
weights f,, but that /,, does not. 


Hi, | s,) % ex ot Mie ses Gietianial Pew 


5. EMPIRICAL STUDY 


To determine the feasibility of the calculation of price 
indices using the RWPN model and gain some idea of the 
behavior of the RWPN index, a small empirical study was 
made, using price and quantity data for Canned Tuna in the 
A.C. Nielsen Academic Data Base. Canned tuna has 
somewhat volatile price and quantity behavior, due to 
frequent sales, at sometimes very marked discount. 

The study covered the Northeast USA and the 104 weeks 
of the years 1992-1993. The original data set was rather 
large. To make the investigation manageable, weekly data 
was combined into 4-week periods, giving a total of 26 
periods over two years. Thus for purposes of this study, the 
data were cumulative quantities and quantity-weighted 
average prices over four week periods. 

The homogeneous groups were defined by brand and 
type, as follows: 3 brands here labeled A, B, C of 
“premium” tuna in water, the same three brands of “light” 
tuna in oil, and again the same three brands of “light” in 
water, making 9 groups in all. 
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The study focused on 83 outlets which had positive 
quantities over most of the 4 week periods, for each of the 
9 distinct groups. 

The RWPN based index iy and the adjusted RWPN 
based index /,, were calculated for four time intervals. In 
each case, the final period ¢ = 26, and the early period was 
taken successively as t’ = 3,6, 10, 14. For the purpose of 
comparison, we also calculated the corresponding Laspeyres 
and Paasche Indices. These two standard indices provide 
also a basis of indirect comparison to the Fisher and 
TOrnqvist, which will be about half way between them. 

Figures 1 and 2, for premium and light tuna respectively, 
give the values of the four indices for the four time 
intervals, the points representing the state space indices, the 
lines used to indicate the Laspeyres and Paasche. The 
adjusted RWPN Z: _ is invariably larger than the unadjusted 
RWPN 1I,,,. Note that, since it is the first period that we are 
varying, where the path of indices is monotone up, this 
would suggest a downward trend in the cost of the 
particular tuna group (and vice versa). 

We observe that the new indices are not out of line with 
the traditional indices, frequently lying between the 
Laspeyres and Paasche, but they tend to be considerably 
more stable as ¢’ changes, suggesting possibly that the 
traditional indices are reacting to “noise” in the data, and 
that, in fact, basically very little change is going on in this 
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two year period. It may also be observed in Figure 2, that 
Light in Oil and Light in Water have similar within brand 
behavior, suggesting that we might have taken a broader 
“homogeneous” grouping. 


6. FURTHER WORK 


The investigation described in this paper suggests several 
topics for further research. 

Measures of precision and estimates of the RWPN 
indices, in terms of variances or confidence intervals based 
on the state space model, need to be worked out. Even 
those who are dubious about the viability of a stochastic 
methodology in price indices, find the possibility of having 
a measure of precision appealing (Diewert 1995). It would 
also be of interest to get measures of precision of more 
standard indices, based on the state space model. 

Empirical work is desirable that investigates more 
closely what groups of items might best be considered 
“homogeneous”. Also, models possibly more elaborate 
than the simple RWPN model require investigation. In this 
respect, the use of scanner data will be a great help, 
supplying as it does, quantity data as well as prices, in great 
detail. 


Brand A, Premium in Water 
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Figure 1. 


Four Price Indexes for Four Time Intervals, Premium Tuna 
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Brand A, Light in Oil Brand A, Light in Water 
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Brand C, Light in Oil Brand C, Light in Water 
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Figure 2. Four Price Indexes for Four Time Intervals, Light Tuna 


The state space methodology has methods of handling APPENDIX 
missing data (Harvey 1990, Section 3.4.7). A point of 
major concern is how well these models will handle missing Details of expressions (5) and (6). 
data in estimating price indices. In particular, since in 
practice most data for calculating price indices is based on 
a small sample of items available, we need to know the 
robustness of state space indices to the absence of data. 

Algorithms for smoothing and forecasting of state space 
models, are well known. Their use in revising and f 
forecasting indices, might be of great interest. =I (a Ei Ofat i 

Finally, in this paper we have focussed only on getting 
an index for a single homogeneous group. It would be of 
interest to develop a state space model that combines ie = 
groups and enables us to get an overall measure of ie OES) > fi 108 (Pu! Pei) 
purchasing power. 


We have that 


a 4 fl | Pi; Peis i Prsayi 
(he i ieee ee 
Pet eos Pri 


and letting 


we have that 


usp hy I log(r, 7 aly cy) 
- Det Livi e epi + Miscier) 
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where the moments are calculated conditional on the state 


S,, as in Section 4.3. Let v = var(H,,,). Then 

E(H,,) = E(Hy,|S,) = 

ee Oi eg Ps Bg) Mp Wate Pe 
and 

v = var(H,,,) = var(H,,| S,) = 
va y E ss, US Dy » Gye SjJjr> 
Cr tl 

where 0: is the covariance of e,, and E,,1 We note that 


v=(t- t! Ds os 6,,, in the special case that the errors €,, 
are independent and identically distributed at each time 
period. 

Wenowconsider estimator ihe =exp(f, + f,_,+ Ay, )- 
We: find, that 2p) Sexp (ast Mot ogee WY)» 
where 


v= val 15) = {3 fpr, |S.) + 
t'+1 


t'+1 


t 


Yr var (fi, | S,) = {32 


eel | 


24 x2 
Ye + Vv Pr-1 (Se 


with 
t v 
teak ee )) ui] Joi ky)) 
v=tt+l u=t+l 
and 
t v 
rs yy I] dl & k.,)s 
v=t'+] u=t' +1 
where 


k, Pv) Gatch of 1} 


and Pyjq-1> Py are the mean square errors of fi, given data 
up to t- 1, t respectively, and are estimated using the 
Kalman Filter. 

This result follows from the equations used in estimating 


My 


A, - ky, ORs k,) A, 


A 


Be ei) loka 


A 


KB,» 


Kay )A,. 


(cf. Harvey 1990, equation 3.2.8), by expressing each fi, in 
POUOUY fy Ves ray Vireas bly 
In comparing v and ¥, we find, empirically that 
t 


2 *2 ’ 
Dt Aaa he pee (albeit 


t'+1 


Ay) = Ke Vest ss dd “i 
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We here consider the simple case where var(e,,) = o,, and 
cov(E,,, € ui ,) =po,,, with p > 0, for i’ + i, that is where not 
only variances, but all covariances are equal and 
non-negative. It then can be shown that 


ai Sp DTD Dp e/a Sek 


where n is the number of items in the group. The lower 
bound is achieved in the case f, = 1/n, and the upper in the 
case p=0. In the first case, no bias adjustment is 
necessary; in the second, we would take A v) =vV-9, 


where 0 = (t- tn), f 655 ; and ¥ = Cone YW Py} 8e¢- 
These correspond respectively to ibs and 1 ie 
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Treatment of Nonresponse in Cycle Two of the National Population 
Health Survey 


JEAN-LOUIS TAMBAY, IOANA SCHIOPU-KRATINA, JACQUELINE MAYDA, 
DIANA STUKEL and SYLVAIN NADON' 


ABSTRACT 


The National Population Health Survey (NPHS) is one of Statistics Canada’s three major longitudinal household surveys 
providing an extensive coverage of the Canadian population. A panel of approximately 17,000 people are being followed 
up every two years for up to twenty years. The survey data are used for longitudinal analyses, although an important 
objective is the production of cross-sectional estimates. Each cycle panel respondents provide detailed health information 
(H) while, to augment the cross-sectional sample, general socio-demographic and health information (G) are collected from 
all members of their households. This particular collection strategy presents several observable response patterns for Panel 
Members after two cycles: GH-GH, GH-G*, GH-**, G*-GH, G*-G* and G*-**, where “*” denotes a missing portion of data. 
The article presents the methodology developed to deal with these types of longitudinal nonresponse as well as with 
nonresponse from a cross-sectional perspective. The use of weight adjustments for nonresponse and the creation of 
adjustment cells for weighting using a CHAID algorithm are discussed. 


KEY WORDS: Longitudinal surveys; Treatment of nonresponse; CHAID algorithms. 


1. INTRODUCTION 


In 1996-97, Statistics Canada completed data collection 
for Cycle 2 of the National Population Health Survey 
(NPHS). This longitudinal survey was launched in 1994 to 
provide comprehensive information on the health status of 
the Canadian population and on the determinants of health. 
The in-scope population covers residents of households and 
health institutions throughout the country. In the provinces 
the household questionnaire has two main components 
which are administered using computer-assisted inter- 
viewing. TheGeneralcomponentcollects socio-demographic 
and basic health information for each member of the 
household. The Health component obtains more detailed 
health information about the household member selected to 
participate in the longitudinal panel. 

Although the NPHS is a longitudinal survey, its objec- 
tives also include the production of periodic cross-sectional 
estimates (Catlin and Will 1992). The data collection 
methodology reflects both longitudinal and cross-sectional 
needs. Panel Members, chosen in Cycle 1, are followed-up 
every two years for up to 20 years. Persons residing with 
the Panel Members at those times provide General compo- 
nent information for use in cross-sectional estimation. As 
the cross-sectional coverage of the sample deteriorates over 
time, the sample needs to be “topped-up” periodically. The 
first top-up is planned for Cycle 3, in 1998. 

This paper presents the methodology developed in Cycle 
2 to deal with nonresponse at the household and person 
levels (flagging will be used for item nonresponse). The 
methodology is based on reweighting respondents within 


sub-populations called weighting cells to account for 
nonresponse. Reweighting is a common approach for the 
treatment of item nonresponse. The bias and variance of 
this approach have been considered by Thomsen (1973), Oh 
and Scheuren (1983), Kalton and Kasprzyk (1986) and 
Little (1986), among others. If weighting cells are defined 
such that nonresponse occurs almost completely at random 
within each cell then the bias due to nonresponse can 
become negligible. In a similar vein David, Little, Samuhel 
and Triest (1983) extended to nonresponse the theory 
developed by Rosenbaum and Rubin (1983) in the context 
of propensity score matching in observational studies. 
Their results imply that reweighting can adjust for 
nonresponse bias when the weighting cells are formed 
based on the propensity to respond. 

An overview of the NPHS sample design and outputs for 
the first two Cycles is given in Section 2. Section 3 
presents the nonresponse treatment strategies and their 
results. Concluding remarks are given in Section 4. Note 
that the methodology presented pertains to the provincial 
household samples; it does not cover the samples in the 
territories and in institutions. 


2. OVERVIEW OF THE NPHS DESIGN AND 
OUTPUTS 


2.1 Cycle 1 Sample Design 


The initial sample of households was selected in 1994 
using the sample selection vehicle built for the Canadian 
Labour Force Survey (LFS), and, in the province of Quebec, 
using dwellings that had participated in a health survey 
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conducted by Santé Québec the previous year. In both cases 
the households or dwellings were selected at random within 
stratified samples of clusters selected using probability 
proportional to size. The clusters were organized into 
replicates and collection period to capture seasonality and 
for variance estimation purposes. There were two 
“summer” collection periods (June and August) and two 
“winter” collection periods (November and March, 1995). 

Figure 1 illustrates the Panel selection mechanism 
applied outside the province of Quebec. Sample house- 
holds were randomly designated as “Adult” or “Children” 
households, and as eligible for screening or not, prior to 
collection. Screening increased the presence in the panel of 
inhabitants of larger households who would be under- 
represented with the selection of only one member per 
household, particularly children and youths. Households 
eligible for screening were rejected from the sample if they 
had no member aged under 25. Screening was not used in 
Quebec as information from the provincial health survey 
allowed the application of different sub-sampling rates by 
household type and size. 


Sample Unit Household Panel selection 
Type Characteristic restricted to: 
“Children” No member under 25 N/A — hhid rejected 
household 
Eligible for No children, some Any member 
Screening members under 25 
(EFS) 
Children present Child members 

“Children” No children present Any member 
household 
not EFS Children present Child members 


“Adult” hhid All Members over 12 


Figure 1. Panel Selection Mechanism Outside Quebec 


The classification into “Adult” and “Children” house- 
holds was done for an operational reason: the Health 
questionnaire for children, would not be available before 
the winter collection periods. In “Adult” households, which 
could be interviewed any time, children under 12 were not 
eligible for the panel. “Children” households, even those in 
“summer” clusters, were interviewed in a winter collection 
period. If children were present in those households then 
the panel selection was restricted to them. To diminish the 
seasonal distortions to the data collection workload and the 
panel representability brought about by these procedures, 
fewer households were classified as “Children” households 
in summer clusters, and, with one minor exception, 
screening was applied only to “Children” households. 

Provinces wishing to improve sub-provincial estimates 
could fund additional sample sizes. In three provinces this 
was done by augmenting the sample size in targeted 
regions. In British Columbia an additional sample of about 
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800 households was selected in a local health unit using 
Random Digit Dialling (RDD). The expected total sample 
size in the provinces was approximately 23,000 households 
after screening. 

The above gives a general indication of the 1994 sample 
design which is sufficient for the needs of this paper. 
Readers wishing a more precise presentation of the 1994 
sample should see Tambay and Catlin (1995), or Statistics 
Canada (1995). 


2.2 Cycle 1 Weighting and Outputs 


The major output of the NPHS consists of person-level 
anonymized Public Use Microdata Files (PUMEFs) of survey 
responses (internal versions of those files that include 
information suppressed for confidentiality reasons are also 
created). For 1994 a General PUMF (58,400 records) and 
a Health PUMF (17,600 records) were released containing 
the General and Health information collected from every 
household member and from the selected non-child Panel 
Members, respectively (Statistics Canada 1995). 

The sample weights attached to every record on the 
PUMF were calculated by applying a series of adjustments 
to a basic weight representing the household inverse 
sampling rates (ISR). The ISRs are calculated by multi- 
plying the weights of the original LFS or Santé Québec 
samples by the inverse of the sub-sampling rates applied by 
the NPHS. For the sake of brevity we will only describe the 
main adjustments used outside of Quebec. 

Adjustments to the weights for the General PUMF 
include: (1) a household nonresponse adjustment; (2) an 
adjustment for the rejective method; (3) an adjustment for 
person nonresponse [within responding households] and, 
finally; (4) a simple  post-stratification adjustment. 
Adjustment (2) was applied only to households with no 
member under 25. It was 1/(1 - r,), where r, was the sub- 
sampling rate for the screening applied in the stratum. The 
post-stratification adjustment was done separately for each 
province-age group-sex cross-class. Weights resulting from 
all earlier steps are multiplied by the ratio of known to 
estimated population sizes within the cross-class. The known 
population sizes are in fact Census-based projections. 

The adjustments for household and person level 
nonresponse (at 11.3% and 1.4%, respectively) were 
applied to respondent units as the nonrespondents were 
excluded from the PUMFs. If w, is the sample weight of a 
unit i, the nonresponse-adjusted weight, w adi, is defined as 
Wig (Yan %))/ Oe w,), where the sums are taken 
over all sample units and all respondent units, respectively, 
within nonresponse adjustment weighting cells. Due to a 
lack of information on nonrespondent households the 
weighting cells for household level nonresponse were 
simply cross-classes of NPHS strata and season (ie., 
“summer” vs. “winter” clusters). Weighting cells for the 
person level nonresponse, which was very low, were the 
province-age-sex cross-classes that were used for the post- 
stratification adjustment. 
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Adjustments to the weights for the Health PUMF 
included: (1) a household nonresponse adjustment; (2) an 
adjustment for the rejective method; (3) an adjustment for 
the “Adult/Children” household sub-sampling; (4) an 
adjustment for the longitudinal Panel Member selection; (5) 
an adjustment for Panel Member nonresponse; and (6) a 
post-stratification adjustment. The first two adjustments 
were exactly those for the General PUMF. As the Health 
PUME did not include Panel Members who were children, 
adjustment (3) compensated for those sample households 
where non-children were ineligible for panel membership. 
The adjustment thus applied only to households with 
children and was equal to 1/r, where r was the proportion 
of “Adult” households in the sample. Adjustment (4) was 
the inverse of the probability of having selected the Panel 
Member. The adjustments for Panel Member nonresponse 
(at 3.9%) and for post-stratification were similar to those for 
the General PUMF, and used the same province-age-sex 
cross-classes. Although child Panel Members were not 
included in the Health PUMF, for longitudinal purposes 
their sample weights were obtained as above using 
1/(1 -r) instead of 1/r in step (3). 


2.3 Cycle 2 Sample Design 


In Cycle 2 the focus of the survey was more on 
longitudinal estimation: no sample “top-up” was planned 
until the following cycle. The “Core” sample thus 
consisted of about 17,000 Panel Members and their current 
households. Panel Members were traced and administered 
the General and Health questionnaire components, while 
other members of their household were administered the 
General component only. No follow-up was done for 1994 
nonresponding households. In Alberta, Manitoba and 
Ontario large (non-Core) additional samples were obtained, 
using RDD, to allow the production of cross-sectional 
estimates at sub-provincial levels. In every RDD household 
one member aged over 12 was selected to complete the 
Health component. In Alberta and Manitoba, RDD 
households with children also had a child selected to 
complete the Health component. 

We note that, for cross-sectional purposes, the Core 
sample does not cover very well arrivals in the population 
such as newborns and recent immigrants. The population 
administered the General questionnaire consists of residents 
of households where at least one member was in-scope in 
Cycle 1; households made up entirely of recent immigrants 
(and their newborns) are thus missed. The population 
administered the Health questionnaire consists of persons 
who were in-scope in Cycle 1: recent immigrants and 
children under 2 years-old are excluded from the Core 
target population (they are included in the RDD target 
population). For both the General and the Health 
questionnaires post-stratification is done using population 
figures that do not exclude the recent immigrants. The 
result is that the population of recent immigrants is 
implicitly being estimated for by the population of non- 
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immigrants because the latter’s Core weights are adjusted 
upwards to account for the former’s numbers. This is a 
limitation that is acknowledged in the PUMF documen- 
tation. Alternative methods would have been to post-stratify 
using only non-immigrant population projections or to 
somehow adjust only the weights of less recent immigrants 
(who are covered) to account for the more recent immigrants 
(who are not). These methods would have been difficult to 
apply where, for the General questionnaire, a distinction 
between immigrants in immigrant-only households and 
immigrants in mixed households would have been required. 


2.4 Cycle 2 Weighting and Outputs 


Figure 2 summarizes the survey’s three major outputs 
planned for Cycle 2: a Longitudinal PUMF; a Health Cross- 
Sectional PUME and a General Cross-Sectional PUMF. 
The planned Longitudinal PUMF contains General and 
Health information for both Cycles for the 17,000 Panel 
respondents [note: confidentiality requirements may prevent 
the release of a longitudinal PUMF — in which case only an 
internal microdata file will be produced]. The Health 
Cross-Sectional PUMF contains 1996 General and Health 
information for about 70,000 Panel Members and RDD 
Selected Members. The General Cross-Sectional PUMF 
contains 1996 General information for about 220,000 
members of the Core and RDD samples. The weighting 
processes involved for each PUMF, presented below for the 
Core sample, are described in more detail in Stukel, Mohl 
and Tambay (1997). 


Output LONGITUDINAL HEALTH CROSS- GENERAL CROSS- 
File PUMF SECTIONAL PUMF SECTIONAL PUMF 
Contents General & Health General & Health General only 
Samples _ Core only Core & Core & 
RDD (3 provs.) RDD (3 provs.) 

Units Panel Member PM/RDD Sel. Mem. All Hhid. Members 

(PM) 
Size = 17,000 records ~ 70,000 records = 220,000 records 


Weighting 1.Base Year 1.Base Year Weight 1.Base Year Weight 


Strategy Weight 2.PM Nonresp. Adj. 2.HhId. Nonresp. 
(for Core 2.PM Nonresp. 3.Core/RDD Adj. 
Sample) Adjustment integration 3.Weight Share Adj. 
3.Post- 4.Post-stratification 4.Hhld. Mem. NR 
stratification Adj. 
5.Core/RDD 
integration 


6.Post-stratification 
Figure 2. Description of Output Files for Cycle 2 


Respondent survey weights on the Longitudinal PUMF 
are obtained by adjusting a base year weight first for 1996 
panel nonresponse and then for post-stratification. The 
base year weight represents the inverse sampling rate for 
1994 including all Health PUMF adjustments described in 
section 2.2 up to adjustment (4) for panel selection (a 
correction is needed for the “removal” of the 1994 
provincial sample additions). The weight adjustment for 
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nonresponse is the focus of the following section and will 
be described there. Post-stratification is done to reproduce 
1994 provincial population counts by age-sex categories. 

For the Health Cross-Sectional PUMF, the weighting 
process for (Core) Panel Members involves three or four 
steps. Usually, the base year weight is adjusted for Panel 
Member nonresponse, as explained in the following section, 
and for post-stratification (to match 1996 provincial or 
regional population counts by age-sex categories). In prov- 
inces with RDD samples the extra step is the integration 
with the RDD sample. The integrated estimate is obtained 
by a somewhat degenerate adaptation of the Skinner-Rao 
dual frame estimator (Skinner and Rao 1996). 

For the General Cross-Sectional PUMF, the weighting 
process for the core sample involves five or six steps. First, 
once more, is the calculation of the base year weight. Then 
comes an adjustment for nonresponse at the household 
level, discussed in the following section. The next step is 
the application of the “weight share method”. The method 
was described by Ernst (1989) and developed further by 
Lavallée (1995). The Panel Member’s weight, divided by 
the number of persons in his/her household who were in- 
scope in Cycle 1, is assigned to all household members 
including those who were not in-scope in Cycle 1 (e.g., 
births, immigrants). The method is unbiased for estimates 
of totals for the population of households where at least one 
member was in-scope in Cycle 1. The next step is a house- 
hold member nonresponse adjustment. In RDD provinces 
this is followed by integration of the Core sample with the 
RDD sample (this time for all ages). Post-stratification is 
done in a similar fashion to that for the Health Cross- 
Sectional PUMF. 


3. CYCLE 2 CORE SAMPLE NONRESPONSE 
STRATEGY 


This section presents the strategy adopted for the 
treatment of Cycle 2 nonresponse in the Core (non-RDD) 
sample. Adjusting for nonresponse was done once again 
using the weighting cell approach except that, this time, 
Cycle | data were available to create weighting cells that 
are more homogeneous with respect to the propensity to 
respond, and thus more apt to remove nonresponse bias. 
Section 3.1 identifies nonrespondents in the NPHS. 
Section 3.2 discusses two general approaches for the 
creation of weighting cells, giving the one chosen for the 
NPHS. The strategy for the adjustment for nonresponse is 
explained in section 3.3 while section 3.4 describes the 
creation of the nonresponse weighting cells. 


3.1 Definitions of Nonrespondent and Out-of-scope 
Units 


The first step in the treatment of nonresponse consisted 
of its definition or identification. In Cycle 2, questionnaires 
were fully completed for 89% of the Core sample and 


Tambay et al.: Treatment of Nonresponse in NPHS 


partially completed for another 3%. The rest of the sample 
consisted of refusals (3.1%), of cases where the Panel 
Member could not be traced (1.7%), had died (1.7%), had 
left Canada (0.5%), or was institutionalized (0.4%), and of 
other types of nonresponse such as temporary absences and 
special circumstances (0.7%). Within responding house- 
holds person level nonresponse was very low: 1.8% for the 
General questionnaire and 1.1% for the Health question- 
naire. We first identify cases that are not nonresponses for 
longitudinal and cross-sectional purposes. 

For longitudinal purposes a death is considered a valid 
survey response. Panel Members who had died before 
Cycle 2 had their status recorded as such and the 1996 
portion of their data coded as “Not Applicable” on the 
Longitudinal microdata file. Panel Members who moved to 
an institution or to the Territories were followed-up and 
their responses were used for longitudinal purposes. Panel 
Members who left the country were not followed-up but 
were treated as longitudinal nonrespondents even though it 
would have been more accurate for some analyses to have 
considered them as having left the scope of the study. This 
treatment was chosen because such persons would fall back 
in-scope should they move back to Canada. 

For cross-sectional purposes all the cases presented in the 
preceding paragraph were treated as out-of-scope situations. 
This was acceptable because the separate Institutional and 
Territorial survey vehicles assumed the cross-sectional 
coverage of these particular in-scope populations. Out-of- 
scope units were not on the PUMFs but, as they represented 
other such units, they were treated for weighting purposes 
like respondents in all the weight adjustment steps except 
the integration and post-stratification steps. 

Refusals and cases where questionnaires were missing for 
reasons other than those given in the preceding paragraphs 
were defined as nonresponses. As will be seen, a distinction 
was later made between “full” and “partial” longitudinal 
nonrespondents to accommodate different users. 


3.2 Approach for Creating Nonresponse 
Adjustment Weighting Cells 


Twostatistical approaches forcreating response weighting 
cells involve segmentation modelling and logistic regression. 
Anexample of the latter is given in Czajka, Hirabayashi, Little 
and Rubin (1992). The authors obtained advance taxation 
estimates from early tax filer returns using adjustment 
weighting cells that were based on ranges of propensities to 
fileearly. Logistic regression was used to estimate tax filers’ 
propensities to file early. The longitudinal Survey of Labour 
and Income Dynamics (SLID) provides another example 
involving logistic regression (Grondin 1996). Sample units’ 
response indicators were regressed on known (dichotomous) 
characteristics. Adjustment cells for nonresponse were 
generated by cross-classifying the sample units using all the 
significant covariates. In order to respect minimum cell sizes 
and response rates some collapsing was done starting with 
cells sharing all but the least significant covariates. 
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In the segmentation modelling approach a decision tree 
structure is generated from the data by successively splitting 
the data set such that, at each node, the most significant 
predictor for the response variable is used to define the 
following split. The splitting continues until one cannot 
find any significant variable for the split or minimum cell 
size requirements cannot be respected. An early application 
of segmentation modelling for nonresponse adjustment was 
with respect to the Panel Study of Income Dynamics 
(Institute for Social Research 1979). Because of its advan- 
tages, given below, the NPHS adopted the segmentation 
modelling approach using the CHAID algorithm developed 
by Kass (1980). The CHAID (Chi-square Automatic Inter- 
action Detection) algorithm uses X* tests to define splits for 
categorical predictors and retains the most significant split 
at each stage. The splitting, into two or more categories, is 
done differently for ordered and unordered predictors. 
CHAID was applied using the Knowledge Seeker software 
program (ANGOSS Software 1995). Note that Knowledge 
Seeker applies CHAID to continuous predictors by first 
transforming them into ordered discrete variables. 

Advantages and disadvantages of the logistic and 
CHAID approaches are known and documented (for 
example see Kalton and Kasprzyk 1986). The logistic 
regression approach is based on theory familiar to many 
analysts, and can be programmed using a number of 
standard statistical packages. It also provides individual 
estimates of response propensity that can be used directly to 
adjust the weights of respondents. However, to ensure 
reasonable program execution times the number of variable 
and interaction terms used must usually be limited. 
Collapsing cells can also become complicated, as in the 
case of SLID above. The CHAID algorithm offers the 
advantages of accepting a large number of covariates and, 
by its decision tree structure, easily accommodating 
interactions among them. Moreover, minimum cell size 
requirements can easily be incorporated as program execu- 
tion parameters. Its main disadvantages are a less familiar 
theoretical underpinning (Knowledge Seeker is advertised 
as a “data mining” tool) and the limited documentation and 
software available for its implementation. It should also be 
mentioned that, while some statistical packages such as 
SUDAAN and PC CARP can incorporate the sample 
design when fitting logistic models to survey data, this is 
not the case with CHAID. The NPHS tried to address this 
limitation by including as predictor variables characteristics 
that were related to the sample design (see Section 3.4). 

Two empirical studies comparing the logistic and 
CHAID approaches for the treatment of nonresponse 
obtained different results. Rizzo, Kalton and Brick (1996) 
did not find much of a difference between the two 
approaches for the Survey of Income and Program 
Participation. On the other hand Dufour, Gagnon, Morin, 
Renaud and Sarndal (1998), in a simulation study for SLID, 
obtained a lower bias after nonresponse adjustment with the 
CHAID approach. 
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3.3 Adjusting for Nonresponse in the Core Sample 


Nonresponse adjustments had to be developed for each 
PUMEF: Longitudinal, General (Cross-Sectional) and Health 
(Cross-Sectional). We will deal with the General PUMF 
first. 

As Figure 2 showed, the weighting strategy for the 
General PUMEF required separate adjustments for non- 
response at the household and at the person levels. In 
creating adjustment cells for household level nonresponse, 
characteristics of the Panel Member as well as those of the 
household were considered as nonresponse predictors. This 
was done for three reasons. Firstly, as the Panel Member 
was the link to the household in Cycle 2, his or her 
characteristics may be related to finding the household and 
obtaining a response (the first contact will often be through 
him or her). Secondly, a few personal characteristics of the 
Panel Member, such as race, are in some sense household 
characteristics. Finally, using Panel Member characteristics 
was not incompatible with our need to use a variety of 
information for the construction of weighting cells. If Panel 
Member characteristics are not significant, then CHAID 
simply does not retain them. 

Person level nonresponse to the General component 
occurred when the information was available for some but 
not all of the household members, perhaps due to members’ 
refusals or temporary absences. Given the low 1.8% 
nonresponse rate at the person level, it was felt that the 
creation of weighting cells based on province-age-sex 
categories (as in Cycle 1) would be sufficient for our needs. 

In contrast to the General PUMF, the adjustments for 
household and person level nonresponse for both the 
Longitudinal and the Health PUMFs could be combined 
into a single adjustment as they concerned only one 
person — the Panel Member. A single set of adjustment 
cells thus needed to be created. 

For the Longitudinal PUMF it was noted that the data 
items came from both the General and Health components 
but that response rates for the two components were 
different. This difference produced data with different 
Cycle 1-Cycle 2 reporting patterns: GH-GH, GH-G*, 
G*-GH, G*-G*, not to mention longitudinal nonresponse 
patterns GH-** and G*-**, where the letters stand for the 
component reported each Cycle (‘“*” if not reported). To 
maximize the utility of the data it was decided to do two 
adjustments for longitudinal nonresponse. One adjustment 
would be for the “Full Longitudinal Response” pattern GH- 
GH. In other words, all other response patterns would be 
considered as nonresponses. The other adjustment would 
be for the “Partial Longitudinal Response” pattern which 
included cases where, at minimum, General information 
was available for each cycle (patterns GH-GH, GH-G%, 
G*-GH and G*-G*). The Full Response data set could be 
used by researchers who would like to analyse a full longi- 
tudinal data set covering the entire questionnaire contents. 
The Partial Response data set could be of use to researchers 
primarily interested in the types of variables that are on the 
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General questionnaire. As the counts in Table 1 below show, 
the Partial Longitudinal Response data set is only about 3% 
larger than the Full Longitudinal Response data set. 


Table 1 
Longitudinal Response Patterns 
Response Type Cycle 1-2 Number 
Response of records 
Pattern 
Full Partial 
| a GH-GH 15,670 
& GH-G* 110 
@ G*-GH 366 
a G*-G* og) 
GH-** 1,014 
Giiake 94 
Total 17,276 


Based upon the above, adjustment cells must be built for 
five types of responses (or nonresponses) in Cycle 2: 


e General PUMF — household response 
e General PUMF - person response 

¢ Health PUMF — combined response 

e Longitudinal PUMF - full response 

e Longitudinal PUMF — partial response 


Only three sets of adjustment cells were created for those 
response types. Adjustment cells created for the General 
PUMEF household level responses were also used for the 
Longitudinal PUMF partial responses because getting a 
response from a household led almost always to obtaining 
a partial response for the longitudinal member (there were 
53 exceptions). Likewise, adjustment cells generated for 
full respondents on the Longitudinal PUMF were used for 
the Health PUMF responses. Although there were 366 
more cases of responses of the latter type (pattern G*-GH) 
it was considered that the same response mechanism was at 
work in both instances. The third set of adjustment cells 
was for person level responses on the General PUMF. 
Province-age-sex categories were used, as was done in 
Cycle 1. 

Note that, although the same adjustment cells would 
serve for different data sets, the nonresponse weight 
adjustments would be calculated separately for each data set 
type. Thus, the 366 records with response pattern G*-GH 
would be treated as respondents when adjusting weights for 
the Health PUMF, but as nonrespondents when adjusting 
weights for full respondents on the Longitudinal PUMF. 


3.4 Creation of Weighting Adjustment Cells 


Separate sets of weighting adjustment cells were created 
for each province. The first step consisted of identifying 
the variables to consider. With CHAID the number of 
variables that could be considered was not really an issue, 
and different types were considered. Characteristics of the 
household, or dwelling, as well as personal characteristics 
of the Panel Member would of course be considered. In an 
effort to incorporate the design of the survey into the 
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analysis some characteristics that were related to the design 
of the survey or to the sampling weight were also consi- 
dered. These included geographical variables such as the 
Census Metropolitan Area code or the Urban/Rural 
indicator, special Cycle 1 design variables such as the flag 
identifying households for screening and the “Adult/ 
Children” household type, and characteristics related to the 
application of those design variables, such as the presence 
in the household of a member aged under 25 or of a child. 
The household size was used as it was a household charac- 
teristic and was also related to the sample weight. From 
experience, it was also decided to include, in addition to the 
household income characteristic, a dummy characteristic 
that identified if household income had been reported in 
Cycle 1 or not. As a change of address can lead to an 
unable-to-trace nonresponse situation we would have liked 
to use a change-of-address identifier. However, in some 
nonresponse and no contact situations it was difficult to 
ascertain whether the Panel Member had indeed moved. In 
the end a “Mover” variable, which identified whether the 
Panel Member had changed provinces between Cycles, was 
used in the analysis even though this was far from ideal 
because the default value would be “no”. Personal charac- 
teristics from the Health questionnaire component such as 
Smoker/Drinker status, Health Index Level and Mental 
Health/Distress Scale were not used because they were not 
available for almost 500 Panel Members. 

The variables used are listed below. The nonresponse 
indicator, which was the dependent variable, had its values 
assigned according to the definition of nonresponse being 
used. 


DESIGN/GEOGRAPHICAL VARIABLES 


PROVINCE The analysis was done at the provincial level 


CMA Census Metropolitan Area (0 if not a CMA) 

URBAN Urban/Rural Indicator 

REJECT Flag if the unit (household) was eligible for 
screening 

ACFLAG  “Adult/Children” design classification for 
the unit 


DWELLING/HOUSEHOLD CHARACTERISTICS 


DWELL Dwelling type (10 categories) 

OWNER Owner/Renter Indicator 

FAMTYP Family Type (unattached individual, single 
parent hhld., married couple hhld., other) 

INC Household Income Adequacy (5 levels) 

INCNR Nonresponse flag for INC 

INCSRC Main source of income (6 categories) 

*HHSIZE Household size 

UND25 Indicator of members under 25 years old 

KIDS Indicator of children under 12 years old 
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PERSONAL CHARACTERISTICS OF PANEL 


MEMBER 
SEX Sex 
AGE Age in years 
AGE16 Indicator if aged 16 or older 


MARIT Marital Status 
FAMID Family Identifier within household (A, B, C, ...) 
RACE White, Black, Aboriginal or Other 
BORN Place of birth (Canada, USA/Mexico, 
S. America/Africa, Europe/Australia, Asia) 
AGIMM Age at immigration (for immigrants) 
*MOVED = Changed province indicator (see text) 
tEDUC Highest level of education (12 categories) 
*STUDNT Student Indicator 
MACT Main Activity (8 categories) 
*NUMJOB Number of jobs held last year (in Cycle 1) 
RESTR Restriction of Activity Flag 
*CAUSE Main Cause of Restriction (12 categories) 
CONSUL Number of consultations with a Medical 
Doctor 
INHOSP Overnight Hospital Patient Flag 


*CHRONIC Number of Chronic Conditions 


* Indicates the variable was never significant when forming 
classes. 


Figure 3 presents the variables chosen by CHAID to 
build nonresponse adjustment cells for Household Level 
Response and for the Full Longitudinal Response in each 
province. For reasons of confidentiality detail is not given 
on the individual cell sizes and response rates (some of the 
variables used are considered sensitive and are not on the 
PUMFs). However, summary information on the cell 
construction is given in Tables 2 and 3. 


Table 2 
Response Adjustment Cell Characteristics 
(for Household Level Response) 


Cell Sizes Cell % NR rates 


min. max. avg. min. max. avg. 


Prov. #Units #NR 


Nfld. 1,082 40 354 728 541 14 48 3.7 
1,037 51 Sl 4 7S eS ees On 3:60.49) 
NS. 1,085 55 AGN 374 Alien) Jee lOO 53 
N.B. 1,125 59 32 DSOun 25) ed. O54. 4e eo 
Que. 30005 133) el2 3622 303 neas/50 mamleomaloall ara 
Ont. 4,307 315 44 1,038 308 09 25.8 7.3 
Man. 1,205 SOME 205 uel 205i le2 0) Seco 41 4.1 
Sask. 1,168 59 S/O 20m meLOy WO selsisch afl 
Alta. 1,544 116 SPL teSS7 ura) So Son LS) 
BiG; 7230149 S22 ROU OMmEc 40m O29 OS 816 


The results vary by province. As expected, provinces 
with larger sample sizes such as Ontario, Quebec, British 
Columbia and Alberta yield “richer” decision trees. Cell 
sizes and response rates also vary considerably. In Table 2 


153 


on household-level response Manitoba has only one cell, 
and 88% of New Brunswick’s sample is located in one cell. 
Likewise, in Table 3 almost all of Newfoundland’s sample 
is placed in one of its two cells. Cell nonresponse rates 
approaching 40% are observed in a few provinces. 


Table 3 
Response Adjustment Cell Characteristics 
(for Full Longitudinal Response) 


Cell Sizes Cell % NR rates 


min. max. avg. min. max. avg. 


Prov. #Units #NR 


Nfld. 1,082 13 35 1,047 541 C22 2.95, 6.7) 
1,037 80 ANY» ecabsyay PAUL CARIN PX oie Id) 
N.S. 1,085 9623 6 Dues O2me Opa 14.35 68:8 
N.B. eS 86 SOLO mol 48 168 7.6 
Que. 3,000 211 1), Pep, SYS) Pee Sia o TAY, 
Ont. 4,307 470 54 SOLO RIG OO S300 10'9 
Man. 1,205 OLS Ome GS A029 S:68 els ae 726 
Sask. 1,168 83 S089 See O02 S8:9 55 Jet 
Alta. 1,544 148 41 866 172 3 90) 196 


B:C. Wiss I) 33 408 191 AD) 30.3 lel 


Figure 3 shows a variety in the characteristics of 
weighting classes both between provinces and between the 
two types of nonresponse within provinces. In all provinces 
except Alberta the CHAID algorithm uses different 
characteristics for the two nonresponse types as early as at 
the first or second level of branching. A few characteristics 
figure prominently in the early stages of branching in many 
of the trees for both types of nonresponse. They are: 
household income adequacy level (INCNR), income non- 
response flag (INCNR), Race (RACE) and Place of Birth 
(BORN). 

In Figure 3a household income and its related variables 
(INCNR and INCSRC), Owner/Renter status (OWNER), 
Race, Place of Birth and Dwelling Type (DWELL) all were 
used three or more times in forming weighting classes for 
Household Level nonresponse. It is also remarked that in 
five out of nine provinces a personal characteristic of the 
Panel Member was selected at the first stage of branching 
by CHAID. This supports the decision to consider personal 
characteristics when adjusting for household level 
nonresponse. 

In Figure 3b for Full Longitudinal nonresponse Census 
Metropolitan Area (CMA), Marital Status (MARIT) and 
SEX, although not as important at first as Income, Race and 
Place of Birth, were used the most often (5 times each). 

As mentioned earlier, design variables such as the 
rejection flag (REJECT) and the “Adult/Children” flag 
(ACFLAG) were considered in an attempt to incorporate 
the sample design in the CHAID analyses. Although 
these variables were selected only once each, household 
characteristics used by the design, such as the presence of 
children (KIDS) and under 25 year-olds (UND25) did get 
selected occasionally. Household size was not used but 
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(TYPES OF CHARACTERISTICS: DESIGN/GEOGRAPHICAL, DWELLING/HOUSEHOLD, PERSONAL) 


Figure 3. 


Provincial Response Classes Obtained for Cycle 2 Nonresponse 
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Family Type (FAMTYP), which is related to the household 
size, did get selected twice. 

The adjustment cells produced by CHAID were 
reviewed but only in rare cases were they manually altered. 
Within each cell, the weights of responding units were 
prorated to add up to the total weight for responding and 
nonresponding units. The magnitude of the nonresponse 
weight adjustments never exceeded 1.83. 


4. CONCLUSION 


This paper presented the strategy developed for the 
treatment of both longitudinal and cross-sectional non- 
response to Cycle 2 of the NPHS. The approach adopted 
took into account practical considerations such as the need 
for an easy-to-use, yet statistically valid, way of defining 
weight adjustment cells and the need to provide a more 
useful data set (by having separate adjustments for “Full” 
and “Partial” Longitudinal Responses) while keeping the 
additional effort required at a reasonable level (e.g., by 
using weight adjustment cells for more than one purpose). 
Having chosen the CHAID algorithm approach rather than 
logistic regression allowed us more freedom in the number 
and choice of variables to consider: many design variables 
and personal variables could thus easily be considered — 
and some were retained. This did seem to offer some 
promise about the usefulness of those characteristics in the 
treatment of nonresponse. 

On the other hand, a tight production schedule meant 
that some analysis that we wished to have carried out was 
not performed. It would have been interesting to pursue the 
possibilities offered by the CHAID algorithm, for example, 
as CHAID allows the use of a categorical response variable 
we could have classified sample units into three groups: live 
respondents, dead or out-of-scope units, and nonrespon- 
dents. We would have liked to do our own comparison of 
CHAID with a logistic regression approach. We could also 
have attempted to use Health questionnaire variables such 
as the Health Index or Smoker/Drinker status in defining 
weight adjustment cells, although their usefulness would 
have been reduced by the fact that they were not present for 
all units (they are missing in response patterns G*-GH, 
G*-G* and G*-**). Decisions to use the same weight 
adjustment cells for different types of nonresponse should 
be revisited. For example, could the adjustment cells built 
for household level response have been more suitable for 
the Health cross-sectional nonresponse? An attempt to 
compare the efficiency of various nonresponse adjustment 
strategies would involve evaluating their impact on the 
variance of estimators. We could also evaluate the impact 
of our Cycle 2 nonresponse adjustment on the nonresponse 
bias by using the Cycle 1 data available for all panel 
members. Estimates using the full sample would be 
compared to nonresponse-adjusted estimates generated 
from the responding units. 
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Cycle 3 itself will present new problems. A global 
sample “top-up” is planned in that year, which will have an 
impact on our cross-sectional estimation strategy and 
therefore on the treatment of nonresponse. As longitudinal 
nonresponse is increasing we will have to consider side 
effects of the weighting adjustment such as the possible 
creation of outlier weights. Providing sets of weights for 
different types of longitudinal analyses will become 
cumbersome as the number of “partial” response patterns 
will increase. How many patterns can reasonably be 
treated, and which ones? The choice of additional informa- 
tion, such as Mover status, for the treatment of nonresponse 
should be reconsidered. Some imputation for nonresponse 
will likely be used in Cycle 3: the question is how to 
reconcile imputation with the weight adjustment approach 
to nonresponse. As can be seen, a lot of work remains to be 
done for the NPHS. One hopes that we will have time to 
investigate many of those issues before Cycle 3 processing 
is finished. 
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Estimates of the Errors in Classification in the Labour Force Survey 
and Their Effect on the Reported Unemployment Rate 


MICHAEL D. SINCLAIR and JOSEPH L. GASTWIRTH!' 


ABSTRACT 


This paper studies response errors in the Current Population Survey of the U.S. Bureau of the Census and assesses their 
impact on the unemployment rates published by the Bureau of Labour Statistics. The measurement of these error rates is 
obtained from reinterview data, using an extension of the Hui and Walter (1980) procedure for the evaluation of diagnostic 
tests. Unlike prior studies which assumed that the reconciled reinterview yields the true status, the method estimates the 
error rates in both interviews. Using these estimated error rates, we show that the misclassification in the original survey 
creates a cyclical effect on the reported estimated unemployment rates. In particular, the degree of underestimation increases 
when true unemployment is high. As there was insufficient data to distinguish between a model assuming that the 
misclassification rates are the same throughout the business cycle, and one that allows the error rates to differ in periods 
of low, moderate and high unemployment, our findings should be regarded as preliminary. Nonetheless, they indicated that 
the relationship between the models used to assess the accuracy of diagnostic tests, and those measuring misclassification 


rates of survey data, deserves further study. 


KEY WORDS: Misclassification errors; Unemployment rates; Diagnostic tests; Reconciliation; Reinterview surveys; 


Response errors. 


1. INTRODUCTION 


Several articles, Poterba and Summers (1986 and 1995) 
and Abowd and Zellner (1985) used the data from the U.S. 
Bureau of the Census’ reinterview program to estimate the 
misclassification rates of the Current Population Survey 
(CPS) and assessed their impact on estimates of labour 
market transition rates. The estimated misclassification rates 
were based on the assumption, that a particular reinterview 
method, reconciliation, yields the “truth.” Biemer and 
Forsman (1992), Forsman and Schreiner (1991) and 
unpublished research of the U.S. Bureau of the Census 
(1963), have questioned this assumption. The purpose of 
this paper, is to provide estimates of the misclassification 
rates, from response errors in all interviews and reinterviews 
and to explore their impact on the reported unemployment 
rates. In contrast to the earlier papers that were concerned 
with gross flow, we emphasize the accuracy of the labour 
force estimates themselves. Our approach is based on 
extending the Hui and Walter (1980) paradigm, for 
estimating error rates of medical diagnostic tests to trinomial 
classifications. An advantage of this method is that, no 
single interview needs to be considered as perfect. 

Under certain assumptions, Hui and Walter (1980) 
developed a method for estimating the error rates associated 
with a new diagnostic screening test, using a confirmatory 
test with an unknown low error rate. By treating the 
reinterview as the confirmatory test, and the original survey 
as the screening test, this methodology can be used to esti- 
mate the error rates in the original survey, and the reinter- 
view and the prevalence rates of the trait screened for. The 
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Hui and Walter (1980) method requires two subpopulations 
with different prevalence rates of the characteristic. While 
the two tests may have different error rates, the error rates 
for each test are assumed equal in the two subpopulations. 
Furthermore, the model (described in more detail in the 
appendix) assumes that the errors from the two tests 
conditioned on the subject’s true status, are independent. 
The Hui and Walter method was developed for dicho- 
tomous test outcomes, and was adapted by Sinclair and 
Gastwirth (1996) to study misclassification of labour force 
participation rates. Here, we extend the approach to account 
for three classifications: unemployed, employed and not in 
the labour force (NLF), and assess the effect of the misclas- 
sification on the reported unemployment rates. The basic 
model is presented in section two. The reinterview program 
data, to which the model will be fitted, are described in 
section three. The resulting error rates are given in section 
four, along with the “adjusted” unemployment rates, which 
account for the estimated classification errors. In addition, 
a measure of accuracy, the predictive value, used in the 
medical screening literature, is applied to the unemploy- 
ment rate in section four. It shows that the probability an 
individual classified as unemployed in the CPS is actually 
unemployed, varies with the true level of unemployment. 


2. THE DATA AND THE MODEL 


Labour force reinterview data consists of trinomial 
responses from both the original survey and a subsequent 
reinterview. This data for a given subpopulation and year, 
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is summarized in a 3 x 3 table, where the observed frequen- 
cy counts of persons in the table, is denoted by, n Ny gij: With 
this notation: 


— ydenotes the year; 

— gdenotes subpopulation membership, g = 1 or 2; 

— i denotes the subject’s classification by the original 
survey, i = 1 for unemployed, i = 2 for employed and 
i =3 for NLF; and 

j denotes the same subject’s classification by the 
reinterview, 7=1,2 and 3. 


| 


We denote the true prevalence rate for each labour force 
status, i=1, 2 and 3, by Tei for subpopulation g and year 
y. Throughout this paper, we will use the term prevalence 
rate, to refer to the proportion of persons in one of the three 
labour force categories (e.g., The) Note that the fraction, 
of the population in the NLF category equals 

~T,., ~ M9), and that the true unemployment rate in year 

Vel Ways 
y for subpopulation g, is equal to Tyo divided by 
(Mei i“ Tyg? )- 

Each classification rate, B,,,;;, is defined as the probabil- 
ity that the v-th data collection process, r =1 for the original 
survey, and r =2 for the reinterview, will classify a person 
in year y from subpopulation g, to be in category i, 
i=1, 2 and 3when the true status of the individual is 
category j. For example, B,,,,, denotes the probability that 
in the first year (y = 1), a person from the first subpopula- 
tion (g = 1), was classified by the original survey (7 = 1)as 
NLF (i = 3) when the person’s true status is unemployed 
(j =1). The classification rates can be divided into two 
groups, corresponding to those associated with a correct 
classification, and those associated with an erroneous 
classification. For each y, g and r, the probability that 
survey method 7, classifies a truly unemployed person in 
year y from pole atta g gis as unemployed, is 
equal to B, eeties Cis De gr? 4): Lhe corresponding 
probabilities for Males ant NLF are respectively, 
Byor22 = (1 = Byori2 ~ Byers); and Byors3 = (1 — Byori3 — Byera3 )- 
With conditional independence of the original survey and 
the reinterview classification rates, the expected observed 
frequencies, as expressed in terms of the given notation, for 
each of the nine cells associated with a particular year y 
and subpopulation g are: 


Ey) =n, g. Bye 1 Bye Byes) 1 - By go1 By gos) 


+ To Byor12Bygoi2 + 1h - Byes Bygo) Byers Bais) 


E(Ay yy) = Nye TMyes 1-Byg171 Byers) Bygoor + Bygo Byeri2 


+ (1 AAO: yg232) uae ~ Tyo Tyg) Beis B23) 


E (M913) = Nyy Tye 1 Byot 21 Byer 31) Bygast * Byg2 Bye 12Byg039 
oi dl mr | ~The) Beis dl =, Boos B 073) 


E(My gy) = Bye Byer Byei2t 1-By goo Bos) 


+ Myo 1 Bygi 2” Bygi32) Bygora + 1 Mygi Tyga) Byes Byars) 


E(n, oy Ny (Tet Breit By oon 4 Tyg dl Fe Byetia7 Byois2) 


si Bio e232) Fa (1-161 -Tygo) Byei2s By e223) 


EG gs) = g. My gl ¥ ygi2l Byers t Tyg (1~Byet12~ Bygi32) B, 2232 
ris Me 99) Byei23 (1 Bygoi3~ Byeoa3)) 


E(Nyo31) = My Myo Bygi31- Bye By gaa) 


+ (1 Tye 


a TeBye132Pygo12 Tye) Byei2s gi13) Beas) 


E(Ny 3) = Myg Tyg Byes 31 Bygoat + Byg2 Bygi32 (1 - By oi By gaa) 


+ (11g By) 1-Byo1937 Bygi13) Bygors) 


E(n, 33) = yet Th 


g1Pye131Bye031 * MyeBye132B e032 


ce a ~My “Ty a By e1237 Byers) SP yien By e023)» 


where, the total sample size for year y and subpopulation g 
is denoted by n 

The model has 14 parameters (six error rates for the 
original survey, r = 1, six error rates for the reinterview, 
r=2, and two unique prevalence rates) for each 
subpopulation and year. On the other hand, the 3 x 3 table 
for a given year and subpopulation has only 8 independent 
frequencies, or degrees of freedom. As a result, the model 
is Overparameterized and the number of parameters must be 
reduced for estimation purposes. The Hui and Walter 
paradigm enables us to accomplish this. 


3. APPLICATION OF THE MODEL AND THE 
CPS REINTERVIEW PROGRAM 


The U.S. Bureau of the Census’ Current Population 
Survey Reinterview Program (U.S. Bureau of the Census 
1963) is conducted approximately two weeks after the 
initial survey, to measure response errors, and to evaluate 
interviewer performance. The sample design for the 
reinterview, consists of the self-weighting random sample 
of households (Levy and Lemeshow 1980) among the 
selected interviewer assignments. The sample size is about 
1/18 of the monthly CPS sample of 50,000 to 60,000 
household interviews. Two reinterview procedures are 
conducted. Three-fourths to four-fifths of the sample cases 
participate in a response-bias study. Here, an initial 
reinterview is conducted and after this interview is 
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completed, the reinterviewer reconciles disagreements with 
the respondent, between the original and the initial 
reinterview responses. Hence, in the response-bias study, 
up to two reinterview responses may be obtained from each 
subject; the first unreconciled reinterview response and a 
reconciled reinterview response. The remaining one-fifth to 
one-fourth of the sample households receive a reinterview 
without reconciliation. 

In the response bias study, the reinterviewer is instructed 
not to look at the original survey responses until the initial 
reinterview is completed. Forsman and Schreiner (1991) 
and Schreiner (1980) suggested that the reinterviewers may 
change the initial reinterview responses to match the 
original response, as they observed that the rate of 
disagreement between the original responses and the initial 
reinterview responses were greater in the unreconciled 
sample. Sinclair (1994) and Sinclair and Gastwirth (1996) 
showed that these differences were statistically significant. 
As aresult, the reconciliation process creates a correlation 
between the original and unreconciled reinterview re- 
sponses, in the reconciled sample. Hence, we decided to 
limit our analysis to the original and unreconciled rein- 
terview data from the unreconciled study sample. For the 
purposes of this study, we will assume that in the unrec- 
onciled sample, the errors from the original survey and the 
unreconciled reinterview conditioned on the respondent’s 
true status, are independent. 

To apply the Hui and Walter approach, one needs two 
subpopulations with different prevalence rates. As males 
and females are known to have different labour force partici- 
pation rates, we use them. We also need to assume, that the 
classification error rates are equal in the two subpopulations, 
males and females, i.e., Boars = Boris: At this stage, we 
assume that the classification error rates for the original 
survey and the unreconciled reinterview, may be different, 
and that they may differ by year. With this reduction, for the 
two subpopulations, in a given year, we now have a total of 
12 error rate parameters and 4 prevalence rates, yielding 16 
parameters. Since two 3 x3 tables contain a total of 16 
degrees of freedom, estimation is possible. In this paper, we 
have analyzed the CPS unreconciled reinterview sample 
data for the period 1981 through 1990. Complete yearly 
data for 1987 as well as more recent data, were not available 
from the U.S. Bureau of the Census. 

The CPS estimates of the unemployment rate are 
published regularly by the Bureau of Labour Statistics (BLS) 
(see Bureau of Labour Statistics 1992). Since the reinterview 
is a sub-sample of the full CPS sample, the original survey 
estimates of the unemployment rate from the reinterview 
sample, will differ from the BLS published results. Data 
processing procedures are used on the full sample CPS, that 
are not applied to the reinterview data. For example, the full 
CPS sample is weighted, based on the sample selection 
probabilities, and nonresponse adjustment factors are applied 
to the data. Given these differences, the estimated preva- 
lences from our model, based solely on the reinterview data, 
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are not directly comparable to the BLS reported values. We 
have used the CPS reinterview data, primarily to estimate 
the error rates in the original survey. Furthermore, we have 
treated the unreconciled reinterview data as a simple random 
sample of the population, for analysis and hypothesis testing 
purposes, throughout this paper. Using these error rate 
estimates, we estimate adjusted Bureau of Labour Statistics 
(BLS) unemployment rates, where the term adjusted, means 
that the reported values have been modified to account for 
the misclassification in the survey. The formula for esti- 
mating the true unemployment rate as a function of the 
reported BLS prevalences from the full CPS sample, and the 
estimated classification error rates as obtained from the 
unreconciled reinterview data, is given in the appendix. 


4. DATA ANALYSIS AND RESULTS 


The first step in preparing our final estimates, was to 
obtain the parameter estimates, for each of nine yearly data 
tables, using the SAS NLIN procedure with the Gauss- 
Newton weighted least squares method. As the reinterview 
procedures remained constant during the period, we decided 
to test the hypothesis, that each of the error rates remained 
equal across the years studied, i.e, B.. = Be erij for all 
years y # y’. Inconjunction with the basic assumption, that 
the error rates for males and females are equal, i.e., 
Boar = B,,,,;» this implies, B,,, = By», for all y* y’ and 
gg’. 

From the two sets of results, we conducted a likelihood 
ratio test under the assumption, that the reinterview sample 
is a simple random sample of the population, to test the 
assumption that each of the error rates was the same for all 
years. The likelihood ratio statistic, - 2log with 96 
degrees of freedom (144 parameters in the full model less 48 
parameters in the reduced model) yielded a value of 84.06 
with a p-value of 0.8027. Hence, the data is consistent with 
the reduced model, enabling us to use the reduced model 
estimates and to simplify the notation. We will now use B 
to denote B,,,,; for all g and y. 

The estimated error rates for the original survey and for 
the unreconciled reinterview, are presented in Tables 1 and 
2, respectively, with their estimated standard errors. The 
estimated reinterview error rates in Table 2, are similar to 
corresponding error rate estimates for the original survey. 
This similarity indicates that the U.S. Bureau of the Census 
unreconciled reinterview serves as an effective replication. 
The error rate estimates show that the CPS survey 
procedures are able to classify the employed, and those not 
in the labour force, quite accurately. On the other hand, 
these procedures do not perform well for classifying the 
unemployed, as the proportion of truly unemployed persons 
who are classified as unemployed, (1 - B,,, - B,,), is only 
0.8397. 

For comparative purposes we conducted an analysis of 
the 75% sample reconciled reinterview data, for the same 
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1981-1990 period, under the assumption that the reconciled 
responses were error-free. We created a 3 x 3 table for the 
number of persons classified by the original interview, in 
each labour force category, by the number of persons 
classified by the reconciled reinterview, in each labour force 
category. The data is given in Table 3. The table frequencies 
report aggregate data, by year and sex, so that the error rates 
derived from this table, are comparable to our model. Using 
the column status, as the true status, one computes an 
estimate of the error rates. For example, the estimate of B,,,, 
the probability that an unemployed person will be classified 
in the original survey as employed, is 332/17,681 = 0.0188. 
These error rates are presented in Table 1, to illustrate how 
the estimated error rates from our method, based on the 
unreconciled data, differ from those relying on the 
assumption that the reconciled reinterview is perfect. 
Table 1 also presents the estimates of the original survey 
error rates, as obtained by Poterba and Summers (1986), 
using reinterview data (combined for both sexes) for the first 
half of 1981. The Poterba and Summers’ method uses both 
the data from the unreconciled and reconciled samples to 
estimate the error rates. These authors assume that in the 
reconciled sample, the interviewers use the original survey 
data provided, to influence the initial reinterview response. 
As a result, they assume that a reconciled value is only 
obtained for a portion of persons, that should have had a 


discrepancy between the original survey and the initial 
reinterview. When areconciled value is obtained, Poterba 
and Summer assume that the reconciled data is error-free. 
With these assumptions, they use the unreconciled sample to 
estimate the incidence of the error, and the reconciled data to 
provide the information on the true labour force status. In 
summary, both the Poterba and Summers method, and the 
reconciled reinterview estimates, rely on the reconciled 
reinterview data being perfect. 

Table 4 presents the reported BLS yearly unemployment 
rates among those in the labour force, for males and females 
combined, in comparison to the estimated adjusted unem- 
ployment rates based on: (1) our error rate estimates, (2) 
Poterba and Summers (1986) error rates, and (3) error rates 
assuming the reconciled reinterview is perfect. If the results 
in Table 4, are sorted by the value of the BLS reported un- 
employment rate, an apparent trend is observed in the bias in 
the original CPS estimates. Figure 1 shows that the reported 
values, tend to overestimate the actual unemployment rate of 
persons in the labour force in low unemployment years 
(1989, 1988 and 1990), and to underestimate the unemploy- 
ment rate in high unemployment years (1982-1983). 
Furthermore, the bias associated with our method is shifted 
upward from the two other approaches. All three methods 
indicate cyclical effect, the smallest of which is obtained 
when the reconciled reinterview is assumed perfect. 


Table 1 
Estimated Error Rates in the Original CPS Estimates 
Error Rate Description Estimated Value B,,, Estimated 
Parameter Standard Error 
Classified as True Status OurMethod — P&S (1986) ews: eae 
er Employed Unemployed 0.0407 0.0378 0.0188 0.01892 
Bist NLF Unemployed 0.1196 0.1146 0.0838 0.01463 
re Unemployed Employed 0.0049 0.0054 0.0017 0.00124 
Bi 3 NLF Employed 0.0100 0.0172 0.0098 0.00154 
Bis Unemployed NLF 0.0110 0.0064 0.0034 0.00155 
Bio Employed NLF 0.0205 0.0116 0.0053 0.00247 
Table 2 
Estimated Error Rates in the Unreconciled Reinterview CPS Estimates 

Error Rate Description Estimated Value Estimated 

Parameter Classified as Tite State Our ae Standard Error 

Boy Employed Unemployed 0.0333 0.01772 

Bar NLF Unemployed 0.1128 0.01360 

Boro Unemployed Employed 0.0057 0.00135 

Bo35 NLF Employed 0.0145 0.00160 

Bois Unemployed NLF 0.0157 0.00171 

Boo3 Employed NLF 0.0248 0.00238 
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Table 3 
Cross-tabulation of the Aggregated 1981-1990 Original/Reconciled Reinterview Responses 
75% Reconciled CPS Reinterview Data 


Survey Result Reconciled Reinterview 
Original CPS Unemployed Employed NLF Total 
Unemployed 15,868 32 480 16,720 
Employed 332 213,987 744 215,063 
NLF 1,481 2,123 138,077 141,681 
Total 17,681 215,482 139,301 373,464 
Table 4 
Implications of the Error Rate Estimates 
Year y BLS Reported Prob. Unemp. Adjusted Estimate of BLS Reported Difference Estimated 
Unemployment Rate Given Unemployment Rate AUE,, in Reported Standard 
UE, Classified vs. Adjusted Error in 
Unemp. Our Method Poterbaand Reconciled Data Difference 
Summers (1981-1990) 
(1986) Perfect Our Method Our Method 
1990 5.44% 8135 5.27% 5.36% 5.63% 0.17% 27% 
1989 5.20% 8052 4.99% 5.09% 5.37% 0.21% .26% 
1988 5.43% 8113 5.25% 5.35% 5.62% 0.18% 27% 
1986 6.89% 8503 6.97% 7.04% 7.22% -0.08% 33% 
1985 7.09% 8531 7.20% 7.27% 7.44% -0.11% 34% 
1984 741% 8581 7.56% 7.63% 7.79% -0.15% 36% 
1983 9.47% 8894 9.99% 10.00% 10.04% -0.52% 48% 
1982 9.54% .8902 10.08% 10.09% 10.12% -0.54% 49% 
1981 7.50% 8581 7.66% 7.72% 7.88% -0.16% 36% 
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Figure 1. A Comparison of the Bias in the Reported Unemployment Rates as Computed Using Three Methods 
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In the screening test literature (Gastwirth 1987), the 
fraction of positive classifications which are correct, called 
the predictive value of a positive test, is known to vary 
directly with the prevalence of the characteristic. This is 
why, quite accurate diagnostic tests can have unacceptably 
high misclassification rates when populations with a low 
prevalence of a disease, are screened with them. The analog 
of this measure in our context, is the proportion of indi- 
viduals classified as unemployed who are truly unemployed. 
This proportion is given in the third column of Table 4. Even 
though the range of reported unemployment rates is fairly 
narrow, a similar relationship with the unemployment rate 
can be seen. 

While the results of the likelihood ratio test indicated, that 
the error rates were constant throughout the period, the 
referees suggested a further analysis to explore this assump- 
tion. We divided each of the nine survey years into three 
groups, according to the year’s reported unemployment rate. 
Survey years, 1990, 1989 and 1988 were classified as having 
low unemployment, with reported rates from 5.20% to 
5.44%. Similarly, survey years 1982 and 1983 were 
classified as having high unemployment, with reported rates 
of 9.54% and 9.47%, respectively. The remaining years with 
rates ranging from 6.89% to 7.5%, were classified as having 
moderate unemployment rates. With this three group 
structure, we developed an alternative model that assumed 
that the error rates were constant within each of the three rate 
size groups, but allowed each of these groups to have 
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different error rates. The estimated error rates for the 
original interview are presented in Table 5. The error rates 
from Table 1, using the equal error rate model, are presented 
for comparative purposes. 

We conducted a likelihood ratio test, to test the 
assumption that each of the error rates was the same, within 
each of these three groups, in comparison to the initial nine 
year model. The likelihood ratio statistic, - 2 log Awith 72 
degrees of freedom (144 parameters in the full model less 72 
parameters in the three-group model), yielded a value of 
69.25 with a p-value of 0.5697. 

In general, the error rate estimates for the three un- 
employment rate classes, appear to be similar. Because the 
standard errors of the estimated error rates are quite large, a 
formal homogeneity test would have insufficient power to 
detect any variation in an error rate over the three periods. 

To assess the sensitivity of the adjusted unemployment 
rate estimates in Table 4, we recomputed them using the 
error rates from the three-group model. The results are given 
in Table 6, which also provides the standard error of the 
unemployment rate estimates, ranging from a low of about 
1.4% to a high of about 2.6%. 

Figure 2 presents a graph of the bias in the unemployment 
using the three group model, and for comparison, the 
original equal error rate model. The results in Figure 2 are 
quite interesting. While the cyclical effect is still apparent, 
the estimated bias is shifted downward and shows a 
consistent negative bias throughout the business cycle. 


Table 5 
Error Rates in the Original CPS Data Estimated for Three Unemployment Rate Classes 


Error Rate Estimates 


Estimates Using Three Group Model 


Error Rate Description 
Parameter 
Model in 
Table 1 Assumes 
Constant Error 
Classified as True Status Rates Across 
Years 
Est. STE 
Bon Employed Unemployed 0.0407 0.0189 
Bist NLF Unemployed 0.1196 0.0146 
Bip Unemployed Employed 0.0049 0.0012 
Bias NLF Employed 0.0100 0.0015 
Bins Unemployed NLF 0.0110 0.0015 
Bee Employed NLF 0.0205 0.0025 


Low Years Moderate Years High Years 
1990,1989, & 1988 1981, 1984-1986 1982, 1983 
si STE Est. STE Est. STE 
0.0635 0.1061 0.1113 0.1258 0.0974 0.0717 
0.1680 0.0538 0.1000 0.0246 0.1084 0.0221 
0.0000 0.0047 0.0000 0.0098 0.0000 0.0069 
0.0080 0.0038 0.0096 0.0025 0.0096 0.0031 
0.0096 0.0040 0.0109 0.0024 0.0103 0.0029 
0.0187 0.0065 0.0202 0.0034 0.0227 0.0044 
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Table 6 
Implications of the Error Rate Estimates Using Three Group Model 
Year y BLS Prob Adjusted Estimate of BLS Difference in Reported vs. Adjusted 
Reported Unemp. Reported Unemployment Rate 
Unemploy- Given 
meat Rate ee Original Equal Three Original Three Estimate Standard 
mires Gr oup Error Rate Model Group Equal Error Group Error of the 
Model Model Rate Model Model Difference Three 
Group Method 
1990 5.44% 0.9124 5.27% 6.43% 0.17% -0.99% 1.40% 
1989 5.20% 0.9088 4.99% 6.12% 0.21% -0.93% 1.35% 
1988 5.43% 0.9105 5.25% 6.41% 0.18% -0.98% 1.41% 
1986 6.89% 0.9170 6.97% 8.01% -0.08% -1.12% 2.35% 
1985 7.09% 0.9178 7.20% 8.25% -0.11% - 1.16% 2.42% 
1984 7.41% 0.9199 7.56% 8.64% -0.15% SPO 2.53% 
1983 9.47% 0.9400 9.99% 11.18% -0.52% =NlGe 2.05% 
1982 9.54% 0.9404 10.08% 11.27% -0.54% = WIE 2.08% 
1981 7.50% 0.9191 7.66% 8.74% -0.16% -1.24% 2.56% 
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Figure 2. A Comparison of the Bias in the Reported Unemployment Rates as Computed Using the Equal Error Rate Model 
and the Three Group Model 
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5. IMPLICATIONS OF THE ADJUSTED 
ESTIMATES 


The results in Figure 1 and 2 show that, all methods for 
adjusting the unemployment rate for misclassification error, 
indicate that the degree of bias in the reported rate varies 
over the business cycle. Given the differences in the 
estimated bias yielded by the two approaches, it is difficult 
to determine the magnitude of the bias. Unfortunately, the 
estimates are sensitive to the model specification, due to the 
small unreconciled reinterview sample size. This is reflected 
in the large standard errors of the estimated error rates, and 
consequently, the estimated bias. 

Our approach using the assumption that the error rates 
remained constant throughout, suggests that bias in the 
survey estimates is small in years when the unemployment 
rate is between 5.5% and 7.5%. With this model, the 
reported unemployment rate appears to be unbiased when 
the true unemployment rate is around 6.3%, and yields an 
underestimate when the true rate is above this level, and an 
overestimate when the true rate is below it. The underesti- 
mation bias becomes quite noticeable when unemployment 
reaches 9%, while the overestimation bias could be 
meaningful when unemployment is less than 5%. 

Using the three-group model results, implies that the 
reported unemployment rates are underestimates. If the 
finding is accurate, these results show that the bias in low 
unemployment years is still about -0.7%, but can be as high 
as - 1.7% in high unemployment years. This contrasts the 
results obtained from the equal error rate model. 

The fact that both the magnitude and direction of the bias 
in the reported unemployment rate change over the business 
cycle, may affect the use of that rate in studies of the “natural 
rate” of unemployment, and the trade-off between inflation 
and unemployment. Specifically, our results indicate that 
the range of the true unemployment rate over the business 
cycle, is larger than the range of the reported rate (see 
Table 4). Hughes and Perlman (1984) survey the literature 
on the “natural rate” of unemployment, and the trade-off 
between inflation and unemployment, as well as the role of 
search theory in explaining why unemployment is not that 
low at “full” employment. McKenna (1985) provides a more 
advanced treatment of job search theory, and its relationship 
to the duration of unemployment, and the degree to which 
unemployment is voluntary. Resolving the issue of which 
model underlies the misclassification error rates in the CPS 
survey, has important economic implications. If the equal 
error rate model were correct, in periods of low unemploy- 
ment, the reported rate would bea slight overestimate. Hence, 
there would be less true unemployment to explain, by job 
search and related theories. On the other hand, if the three 
group model is the correct one, then even at low levels of 
reported unemployment, there are more persons really 
unemployed. 


6. DISCUSSION 


In this paper, we have presented an alternative method for 
estimating the error rates in the CPS survey. Our study 
differs from prior work, as we follow the Hui and Walter 
(1980) approach to estimate the error rates, by assuming that 
males and females will have the same error rates, and that the 
errors in the original survey are independent of those in the 
unreconciled reinterview. While the errors could be slightly 
correlated, the assumption of independence is standard in 
data analysis of this type, (see Bailar 1968, Chua and Fuller 
1987, and Singh and Rao 1995). A discussion of the bias in 
the H&W method with dependent errors is given in Vacek 
(1985). As for the equal error rate assumption, several of the 
authors cited in this paper (e.g., Poterba and Summers 1986), 
have noted minor to moderate differences in the error rates 
between males and females, under the assumption that the 
reconciled reinterview is perfect. However, this assumption 
has been questioned. For example, consider the estimate of 
B,,,, the probability that an unemployed person, will be 
classified in the original survey as employed. From Table 3, 
we estimate this value under the assumption that the 
reconciled reinterview is unbiased, by dividing n,,, divided 
by n,, (332/17,681 = 0.0188), where ni is defined pre- 
viously, with 7 now corresponding to the classification status 
in the reconciled reinterview. Using the expected value of 
these two frequencies from section 2, we can write an 
expression for the expectation of the estimate in large 
samples as follows: 


E(n,,/n,,) 


‘i 7B 9,1 -By 9) -By31) +291 -By 19 -By 32) Boi C1 -2, 22) B 738 15 


7 (1 -By91-By31) +B 919 +1 -2, -1,)B 13 


= Bio + Bias 


7,(1 -B,5,-By31) 1 


. | 7(1 -B 12 -By3)Boy9 +1 -, -1y)B 1238 915 | 


m1 abr B55) +1851 +(1 ~T, ~My) B13 


(1) 


From (1) it follows that, if the reconciled reinterview 
error rates, er are equal to zero, that this estimator is 
unbiased. However, if the reconciled reinterview is not 
perfect, then the bias in the estimator depends on the 
prevalence rates in the population studied. As a result, if the 
actual original survey error rates are in fact equal in the two 
subpopulations studied, and the reconciled survey classi- 
fications are not perfect, the estimated original survey 
error rates for the two populations will differ. Therefore, 
one cannot use the similarities or differences in the 
estimated error rates for males and females from earlier 
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papers, to justify or to contradict the assumptions used 
here. 

We have also conducted a sensitivity analysis of the Hui 
and Walter (1980) method for dichotomous responses 
(Sinclair 1994), that indicates that the procedure is sensitive 
to a violation in the equal error rate assumption, in some 
circumstances, but the procedure is quite robust in others. 
Further research is needed to develop reinterview procedures 
and analytical techniques, to relax the restrictive 
assumptions currently required in the analysis of the 
reinterview data. 

It should be noted that Chua and Fuller (1987) also 
obtained estimates of the 3-outcome classification errors in 
the 1977-1980 CPS 25% sample reinterview data. Analo- 
gous to our results, their study found that the largest error 
rates were associated with classifying the truly unemployed. 
Poterba and Summers (1995) and Singh and Rao (1995) also 
found this group to be the hardest to classify. Because all 
models examined, indicated that the overall misclassification 
rate of an unemployed individual is around 20%, future 
reinterviews might focus on understanding why these rates 
are so high. Hopefully, this will lead to an improved survey. 

A potential use of the “adjusted” estimates in Table 4, is 
in a sensitivity analysis of the literature (e.g., Abowd and 
Zellner 1985; Poterba and Summers 1995) on gross flows, 
and labour market dynamics, which assumed that the 
reconciled interview was perfect. This is equivalent to their 
adoption of the estimates in the next to the last column of 
Table 3. Similarly, estimates of the classification errors may 
be incorporated in procedures, for estimating probit and logit 
models with misclassified response variables (Hausman and 
Morton 1994), and in the development of formal statistical 
procedures for survey data (Rao and Thomas 1991). It 
should be emphasized, that all the estimates adjusting for 
misclassification, are still in the research phase, and that the 
error rates are not yet estimated with sufficient accuracy, to 
adjust the regular survey data, especially as a new question- 
naire and new interviewing procedures were introduced as 
of January 1994 (Bureau of Labour Statistics 1993). 
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TECHNICAL APPENDIX A 
A Review of the Hui and Walter Method 


The Hui and Walter method was developed for the 
evaluation of diagnostic tests. The advantage of the 
technique is that, it allows the researcher to measure the error 
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rate in a given test, without requiring the comparison test to 
be error-free. To accomplish this task, the procedure uses 
two populations (or subpopulations) with different preva- 
lences, to estimate the parameters. The data from such a 
study, can be summarized in a 2 x 2 table as given in Figure 
A below. This Table for a specific subpopulation, is indexed 
by the letter g. We will denote the frequency of cases from 
subpopulation g, that have a classification from the first test, 
of status i (i = 1 for those having the trait, and i =2 for 
those not having the trait), and from the second test of status 
J VU =1or2), by n,. Let x denote the true unknown preva- 
lence rate of the trait, and let a, and B, denote the unknown 
false positive and false negative rates. These error rates are 
indexed by the letter r, where r = 1 corresponds to the 
outcome from the first test, and r = 2 for the second test, 
(which, in our context, r = 1 corresponds to the original 
interview, and r =2 to a reinterview). The false positive 
rate, a, refers to the probability, that the evaluation from the 
r-th test, will classify the person as positive when in truth the 
person should have been classified as negative. Similarly, 
the false negative rate, B, is the probability that evaluation 
from the r-th test will classify the case as negative, when the 
case has the trait. One (1) minus each of these parameters, 
reflects to the specificity and sensitivity of the test (or 
survey) Classification procedures, respectively. 


Test 1 Outcome Test 2 Outcome 


(Original Survey) (Reinterview) 

Positive Negative Total 
Positive Cell 1 Cell 3 ne 
Negative Cell 2 Cell 4 Ny, 
Total n, Nn, n. 


Figure A. Cross-classification of Test 1 and Test 2 Outcomes 


Assuming the errors of the first and second tests are 
independent of each other (given the true state), the expected 
probabilities, denoted by P,, associated with the cell 
frequencies given in Figure A, for a given subpopulation g 
are as follows: 


For 


Cell 1 Pat = m1 -B, Jd 2) = (i “T)(Q, Ay 2) 
Cell2 Poy = 0,(By,_)(1-Bp g) + 1-1-4), (0h) 
Ces Pao = m (1 -B, .) Bo v(t 1, )(a, (1 “Oy ,) 


Cell 4 Paap = M(B gBr.9) * I-RYI-t, NM 1-0,0)- 4 4) 


From (A.1), we observe that we have a total of five 
parameters, but only three independent cell entries (or 
degrees of freedom), from which to estimate them. 
Therefore, the number of parameters must be reduced. 
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To reduce the parameters, Hui and Walter first, assume 
that, the proportion of cases with the trait, differs by 
subpopulation, which implies that, 2, # 1,. Secondly, they 
require that two subpopulations can be found, such that the 
error rates for each test are the same for both subpopulations. 
The error rates associated with the two tests are allowed to 
differ. For two subpopulations, this implies that in (A.1), 
B.=B,,=B,,, and a,= a, ,=a,,, with B, *B,, and 
a, # a.,. Under these conditions, the number of parameters 
reduces to six, (two prevalence rates, one for each 
subpopulation, and two error rates each for test 1 and test 2). 
Given that the two 2x2 tables contain six degrees of 
freedom, estimation is possible. Notice that if Hj = Ty; and 
the error rates were the same in both subpopulations, then the 
probabilities in (A.1) would be the same for both 
subpopulations, so we would really have one table, and 
estimation would not be possible. Weighted nonlinear least 
squares estimates under the Hui and Walter model, can be 
computed using the Gauss Newton algorithm from the SAS 
Nonlinear Regression (NLIN) procedure. With this 
approach, one can express the observed frequencies, n,., in 
terms of the total sample size, n,, multiplied by the 
probabilities in expression (A.1). Hui and Walter also 
present the closed formed estimators given in (A.2), 
expressed in terms of the observed cell probabilities denoted 


bye 
Bij 
ait y1-Ppa 7 PriPa- + Pou ~Pin*?) 


r 2E, 
p - (PraPa ~PaPro*Pim~Pm*?) (a2) 
r 2E, 

where, 


2 7 
Pay La Pay> Per = 2a Pay’ 
= jz 


[P,,. Pi4 Sob pc oP et (P;.; Pie) + Poy —Piii] 
2D 


aA 


1 
—+ 
‘ile? 


D= + ((Pyy. Po. ~ Pia Pin * Pin ites 
aL 
~4 (Py. > Px Pi Pra Spiele 
with, 


Ey = Py. ~Pyy> Ey = Pr. ~ Py. 


Note that two distinct points exist in the solution set, for 
either a positive or a negative value of D; however, only one 
of the values will yield reasonable estimates. Variances for 


the estimators, derived from the estimated asymptotic 
information matrix, are given in Hui and Walter’s (1980) 


paper. 


TECHNICAL APPENDIX B 


Adjusting the Reported Unemployment Rates 


To evaluate the implications of the estimated error rates, 
we needed an expression for estimating the actual 
prevalence rates (the four z parameters), in terms of the 
estimated error rates and the observed prevalence rates (or 
sample frequencies), from a given survey. In this section, we 
present the formula for these computations. With this 
expression, we can use the BLS reported unemployed and 
employed prevalence rates, as the observed values to 
estimate the adjusted BLS prevalence rates. Such an 
expression is given in (B.1). 

Note that in expression (B.1), we have deleted the g-th 
subscript from the m parameters, so that the expression 
represents the prevalence rates among the general popu- 
lation, males and females combined. Note that, in this study, 
we have assumed that the estimated error rates are equal for 
males and females. 


= 


Bia Bays 


iy 1-B,,-By31 Bris 


Mya Bri ~By23 1-B i 12-By32 By 23 
n 
vine | 
al Biss 
n 
y.. 
n 
_¥2. _® 
n Pras} (B.1) 


ye. 


In this paper, we have three sets of observed values. We 
have two observed prevalence rates from the reinterview 
sample (which is a sub-sample of the full CPS sample), 
including the unreconciled reinterview sample data, and the 
reconciled reinterview data, from the response-bias study 
sample, and BLS reported prevalence rates, as observed 
from the full CPS original survey. We will concentrate our 
efforts on the first and last of these three sets of statistics, the 
unreconciled reinterview sample data, and the published 
BLS estimates. To keep these two sets separate, we will 
define, 


n 
R yi 
a= - 
Vin 
n 
R Pie pos 
eae, (B.2) 
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as the observed unemployed and employed prevalence rates, 
obtained from the CPS unreconciled reinterview sample 
data. The corresponding BLS reported prevalence rates 
based on the full CPS angina survey weighted data, are 
defined by liek and E, 

Similarly, the observed unemployment rate among those 
in the labour force, from the junreconciled _reinterview 
sample dat data, is denoted by UE," , equal to oe divided by 
(Orregt 5 ae and the observed BLS reported unemployment 
rate, is defined as UE, ~. 59 

Simplifying expression (B. 1) in terms of the observed 
reinterview prevalence rates, u," and E, we find: 


ies Us ie eae ed re rede Erbe 


Bint A Spice. +B osB.12*Bis,"} 
{1 ~Bi12~Bi32-By23 Bia +B i 32 *By23 +B 13) 
-B 51819 *B 39-1) -By15(By19*B 39-1) *ByasB io} 


Ny = {Bra Cte hit) 26. Pia Py 
+B 15081038314," +B 51105 -ProsE,} 
{1 = ered orem ren sires lea essa banda 
-B 5:8 112*B:39-)-B 13B,19*B 32-1) Bras ina} (B.3) 


Using expression (B.3), we can compute estimates of the 
adjusted unemployment rate among those in the labour force 
from the reinterview survey, denoted by Al UE," equal to i, 
divided by (1 it,.»)- Note the 4 UE," can be expieised 
as follows: 

ADE! ={-UR +E, +B (US By3tE) 

+B pga U Buss) *Bypa(U,* Br) Busty 
{uF +8, Cl *Bixo-Biai Pins) "Py ne, ei, 13) 


t Boi @ gatby Pica) -E, +B 95 +B ,51(Uj*-Byos)}- (B.4) 


Finally, to obtain the adjusted estimate of the BLS 
unemployment, rate, denoted by, A AUE, , we substitute the 
values of U.?"* for u," and E”'* for EOS R into expression 
(B.4). Note that the estimated eandard errors of the 
estimates for AUE, , presented in section four, were 
computed using a Taylor series approximation method, 
(Wolter 1985). Asa first step in this process, we assumed 
the variance in the published estimates of U,, BIS and Bee 
were negligible. While this is not true, this assumption 
greatly simplifies the computation of the variances, and 
captures the majority of the total variation. This assumption 
is supported by the fact, that the size of the variance of these 
estimates, given the large full CPS yearly sample sizes is 
negligible in comparison to the sampling error associated 
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with error rate estimates, which are based on the small 
unreconciled reinterview sample sizes. In summary, once 
the substitution of On for Ligh , and E, He E; int into 
expression (B.4) iscompleted, we assume that [Bis and £°'* 

are fixed known values in this equation. Finally, the 
sampling variance associated with the difference between 
the adjusted value and the published value, which defines the 
bias in the original estimate, is computed from the sum of the 
variances. Hence, by assuming the published value is 
sampling variance-free, the sampling variability associated 
with the difference or bias, is simply equal to the sampling 
variability associated with the adjusted value. 


TECHNICAL APPENDIX C 


Estimating Standard Errors of the Adjusted 
Unemployment Rates 


For a complex function of several estimated parameters, 
the estimates of the variances associated with this function, 
can be computed using a Taylor series approximation as 
discussed by Wolter (1985). Suppose that the population 
parameter of interest is Y = G(@). Where © represents an 
dimensional vector of population parameters, © = 
@,, ne 0} . If Gpossesses continuous second derivatives, in 
an admissible range for © and ©-hat, then Wolter (1985) 
presents the relationship: 


Y-Y=A+R(O,0) 


where, 
He yrEs) gure 5 
k=1 k 
R(,e) = y anys — - 06, - 6) 
k=1 f=1 0,0 
6<A<e. (G1) 


The remainder term is often regarded of little conse- 
quence, and is eliminated from the relationship. Given the 
first order approximation, Wolter (1985) presents, 


MSE() = E[G(8) - G(e)? 


Var (A) 


AGC) GS) Coy, 
ae 08, 0, v@,9) 


=qd pH d? (C2) 
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where dis arow vector of dimension with the elements, 


_ |9G(2) 


Wolter calls this estimator, the first order approximation 
to the mean square error (equal to the sampling variance + 
the bias of the estimator squared). Higher order approxi- 
mations can be developed, by retaining additional terms in 
the expansion. For purposes of variance estimation, we 
substitute the estimated covariance matrix for )’,, and 
evaluate d at the estimated values of 8. Specifically, in our 
problem, we wish to estimate the variance associated with 
the function of the estimates in expression (C.4), given 
below. 


AN EAE MNRAS able vn BLS BLS 
G(©)=G(B151B 1 319B112B 1328 113By23U, Ey Ve 


_7 BLS, pBLS 4 BLS 4 BLS 
{-U; +Ey ~ +Bi(U, ” Big +8”) 


pA BLS 4 a BLS 4 ~ BLS 
+Bi30U, ~ —Bii3)*Bio3(Uy ~ -By 12) -Bi 34, 
BLS A A DEVIL A A BLS. BLS 4 
{u; +By 131 +B i 12 -Bio1 Bias) *By (Uy ~ +B -By 3) 


+B, #2, © Fiz) oe + Bis Pile Bf 2) 


To create the estimates, we have assumed that the values 
of Oh and E, are fixed (i.e., have a negligible 
sampling variance). Taking the partial derivatives of 
equation (C.4) with respect to the six error rates, and 
evaluating these expressions at the estimated values of the 
error rates, yield a vector d which depends on the values of 
the error rate estimates and the published BLS unemployed 
and employed proportions for each year of the study. With 
our original model, that assumes the error rates are fixed 
across each year, this d vector for the period of study, only 
varies from year-to-year for the published values. For 
illustrative purposes the estimated vector d for 1989 using 
the BLS published unemployed and employed prevalence 
rates of .0347 and .6329 is equal to: 


Bio, 07851 
Bove 558 
A Bir 
Goa 
Bizo --04813 


=H 2918 


Bi; --64214 


Bi; -03884 


The estimated covariance matrix from our SAS NLIN 
analysis, which, based on the original model that assumes 
the error rates are fixed by year, and as such, is the same for 
all years under study, is given below. 


pa 6121 B131 6112 6132 B113 6123 


B,,, 0.000358 -4.7E-05 -3.5E-07 -2.6E-08 -3.9E-07 2.9E-07 
B,,, -4.7E-05 0.000214 -1.7E-07 -5.2E-07 -1.4E-06 -2.8E-07 
Bi, -3-5E-07 -1.7E-07 1.54E-06 2.14B-07 -2.3E-08 9.9E-10 
B,y) -2.6E-08 -5.2E-07 2.14E-07 2.37B-06 -1.5E-08 -6.1E-08 
B,,; -3-9E-07 -1.4E-06 -2.3E-08 -1.5E-08 2.4E-06 -8E-08 
B,3 2.9E-07 -2.8E-07 9.9B-10 -6.1E-08 -8.0E-08 6.1E-06 


Pre and post multiplying the vector d, by the estimated 
covariance matrix, yields an estimated variance for AUE aig 
for 1989 of 6.72 E-6 and a standard error of the estimate 
equal to .0026 (.26%) as given in Table 4. 
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Use of Statistical Matching Techniques 
in Calibration Estimation 


ROBBERT H. RENSSEN' 


ABSTRACT 


This article deals with an attempt to cross-tabulate two categorical variables, which were separately collected from two large 
independent samples, and jointly collected from one small sample. It was assumed that the large samples have a large set 
of common variables. The proposed estimation technique can be considered a mix between calibration techniques and 
statistical matching. Through calibration techniques, it is possible to incorporate the complex designs of the samples in the 
estimation procedure, to fulfill some consistency requirements between estimates from various sources, and to obtain fairly 
unbiased estimates for the two-way table. Through the statistical matching techniques, it is possible to incorporate a 
relatively large set of common variables in the calibration estimation, by means of which the precision of the estimated 
two-way table can be improved. The estimation technique enables us to gain insight into the bias generally obtained, in 
estimating the two-way table, by sole use of the large samples. It is shown how the estimation technique can be useful to 
impute values of the one large sample (donor source) into the other large sample (host source). Although the technique is 
principally developed for categorical variables Y and Z, with a minor modification, it is also applicable for continuous 
variables Y and Z. 


KEY WORDS: Consistency between estimates; General regression estimator; Imputation; Multivariate auxiliary 
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information; Two-way table. 


1. INTRODUCTION 


Most statistical surveys are conducted to obtain estimates 
of simple descriptive finite population parameters. The 
estimates are often presented in tabular form, with cells 
containing estimates of population totals or subgroup totals. 
Often, data are collected on an extensive set of variables, 
producing numerous results for these variables and their 
relationships. In order to save resources and decrease 
response burden, statistical bureaus wish to reduce sample 
sizes and shorten questionnaires. They resort to adminis- 
trative data sources and existing large-scale sample surveys, 
or applying splitting questionnaire survey designs (see 
Raghunathan and Grizzle 1995). As a consequence, meth- 
ods for combining distinct data sources have become a 
popular tool in the production of statistics. Combining data 
sources can be done in many different ways; two well- 
known techniques in survey sampling are statistical 
matching and calibration estimation. 

Singh, Mantel, Kinack and Rowe (1993) describe statis- 
tical matching as a special case of imputation in which there 
are two distinct micro-data sources containing different 
information on different units. One data source serves as a 
host or recipient file to which new information is imputed 
for each record, using data from the other source, which is 
the donor file. More specifically, they consider a host file 
A, containing information on variables (X, Y) and a donor 
file B containing information on variables (X,Z). The 
common variable X can be used to identify similar units in 
the two files. In general, statistical matching deals with the 


problem of completing the records in file A, by imputing 
values for Z using the information on the (X, Z) relation- 
ship in file B. These imputed Z-values suffer from a serious 
limitation in that, the real relationship between Y and Z may 
be completely lost in the enriched host file. This limitation 
amounts to the so-called assumption of conditional inde- 
pendence between Y and Z given X. In order to get rid of 
this conditional independence assumption, Singh ef al. 
(1993) consider a third data set (file C) representing 
auxiliary information about the full set (X, Y,Z). For 
example, this data set could come from a small-scale 
specially conducted survey. They discuss several imputa- 
tion methods to complete file A, by adding Z from file B 
using information from A, B, and C, on the joint relation- 
ships of X, Y, and Z. Singh et al. (1993) give many relevant 
references on statistical matching techniques. We only 
mention Rodgers (1984), Rubin (1986) and Paass (1986). 

In Deville and Sarndal (1992), calibration estimation is 
derived as a general technique to weight sample surveys, 
taking into account the complex design of the sample and 
auxiliary information obtained from external sources (see 
also Deville, Sarndal, and Sautory 1993). The use of 
auxiliary information, i.e., control variables, primarily aim 
at three goals: namely, reducing sampling variance, 
reducing bias due to non-response, and ensuring 
consistency between estimates from various sources with 
respect to the used control variables. There is an extensive 
body of literature on weighting methods in sample surveys. 
We refer to Bethlehem and Keller (1987), Alexander 
(1987), Lemaitre and Dufour (1987), and Zieschang (1990). 


' Robbert H. Renssen, Department of Statistical Methods, Statistics Netherlands, P.O. Box 4481, 6401 CZ Heerlen, Netherlands. 
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This article deals with the specific problem of how to 
estimate the cross-product between Y and Z (e.g., the 
two-way table between Y and Z in case these variables are 
categorical or the covariance between Y and Z in case these 
variables are continuous), using statistical matching tech- 
niques as well as calibration estimation. We assume that 
two data files A and B represent two large-scale sample 
surveys, possibly both obtained by a complex design. In 
order to weight the specially conducted small sample 
(file C), auxiliary information is derived from these large 
samples. It might be difficult to judge whether the large 
samples should be considered as suppliers of auxiliary 
information for the small sample, or vice versa. Through the 
statistical matching, it is possible to incorporate a large set 
of X-variables in the estimation procedure, despite the 
sample size of the small sample. The use of calibration 
estimation makes it possible to take account of the complex 
design of all samples in the estimation procedure, and to 
fulfill some consistency requirements. Most of the article is 
devoted to categorical Y and Z, because of the specific 
properties of these variables. For example, it is shown that 
the marginal counts of the estimated YZ-table, always 
coincide with estimates for the population totals of Y and Z, 
when the ordinary calibration estimator is applied with the 
X-variables as control variables, on the first and second 
large sample respectively. Nevertheless, the proposed 
method is also applicable for continuous Y and Z. 
Throughout this article it will be assumed that X may 
consist of several variables, which may be categorical 
and/or continuous. It is argued that when the X-variables are 
highly correlated with either Y or Z, then our estimation 
method gives relatively precise estimates for the cross- 
product between Y and Z, e.g., for the complete YZ-table 
when ¥ and Z are categorical. 

The proposed estimation procedure closely resembles a 
method presented in Singh et al. (1993, Section 2) to 
estimate a correlation coefficient between Y and Z. These 
variables are assumed to be univariate in this article. Our 
method, however, differs from theirs in that it incorporates 
the complex designs of all data sources in the estimation 
procedure and that it uses the large data sources more 
efficiently in estimating population parameters from the 
small data source. When Y and Z are categorical, and there 
is no linear correlation between _X and Y as well as between 
X and Z, then our method corresponds to incomplete 
post-stratification (Deville and Sarndal 1992, Bethlehem 
and Keller 1987). On the other hand, if Y is perfectly 
correlated with X, then our method gives an estimated 
two-way table between Y and Z which corresponds to an 
estimated two-way table that would have been obtained 
from file B if first the Y-values were imputed. A similar 
result holds if Z and_X are perfectly correlated. 

Although combining distinct data sources across 
common variables may be fruitful from a theoretical point 
of view, in practice, complications may arise because 
common variables in the strict sense are not easily found, 
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mainly due to discrepancies between definitions, methods 
of observation, and reference period. These complications 
may be reduced if the survey processes involved, are 
harmonized at an early stage. A promising application of 
the use of common variables, lies in integrated survey 
designs, such as the Dutch Household Survey on Living 
Conditions, see van Tuinen (1995), Bakker and Winkels 
(1998), Winkels and Everaers (1998), and Hofmans (1998). 
The questionnaire design of this survey has a three-shell 
structure. The first shell contains questions on demographic 
and socioeconomic issues, and level of education. The 
second shell contains a few easy to answer core questions, 
on every relevant aspect of living conditions. The questions 
in the third shell also concern living conditions, but they are 
more exhaustive than the questions in the second shell. In 
order to shorten the time it takes to answer, the third shell 
questionnaire is split. Each respondent has to fill in the 
complete questionnaire of the first and second shell and one 
sub-questionnaire of the third shell. On account of the third 
shell, the sample is split into sub- samples associated with 
each sub-questionnaire. The sampling design of each 
sub-sample can be described as two-phase sampling for the 
general regression estimator. 

The organization of this article is as follows. The 
theoretical framework is developed in Section 2. For this 
purpose it is convenient to discuss a calibration estimator 
for the small sample, obtaining auxiliary information from 
two distinct registrations instead of two distinct large 
samples. One registration contains values on X and Y and 
the other registration on X and Z. Sections 2.1 to 2.4 deal 
with categorical Y- and Z-variables. In Section 2.1, the 
registrations are used to obtain a first synthetic estimate of 
the ¥Z-table by regression methods of imputation. It is 
shown that this synthetic two-way table has some inter- 
esting properties. In Section 2.2 we propose a set of 
calibration equations to weight the small sample, based on 
these properties. We briefly discuss its relationship to 
complete and incomplete post-stratification. A numerical 
illustration is given in Section 2.3. The linkage to statistical 
matching techniques as discussed in Singh ef al. (1993) is 
given in Section 2.4. The treatment of categorical Y and Z 
is unnecessary and restrictive. In Section 2.5, it is shown 
that the proposed weighting technique is also applicable for 
continuous Y and Z or for continuous Y and categorical Z. 
In Section 3, the technique is modified, using auxiliary 
information from two distinct large samples instead of two 
registrations. By means of a simulation study, the modified 
weighting method is compared to the traditional incomplete 
two-way Stratification. Finally, Section 4 contains some 
concluding remarks. 


2. COMBINING REGISTRATIONS ACROSS 
COMMON VARIABLES 


Consider a finite population Q = {1,..., NV} of N persons 
and suppose there are two registrations available of these 


Survey Methodology, December 1998 


persons. The first registration contains of each person k, a 
record with scores y, and x, of the variables Y and_X re- 
spectively, and the second registration of each person k, a 
record with scores z, and x, of the variables Z and X 
respectively, k =1,...,N. Obviously, the variable X is 
present in both registrations. We note that the records from 
both registrations correspond to the same finite population. 
The process of merging these registrations, would be like 
exact matching if X is used to compare the records in the 
one registration with those in the other registration, in an 
effort to determine which pairs of records relate to the same 
population unit (see Fellegi and Sunter 1969). In this article 
we will proceed differently. 


2.1 Formulating the Synthetic Population Totals 


Let Y denote education with p categories and Z denote 
employment with g categories. Then y, is a vector of order 
P, representing p dummy variables. Each dummy variable 
corresponds to a specific category; it equals 1 if person k 
belongs to that category, otherwise it equals 0. Analogously 
defined, z, is a vector of order g. Further, X may be the 
result of a complete or incomplete crossing (stratification) 
of a number of characteristics (e.g., sex, age, region, marital 
status, efc.). The scores x, are vector valued, of order r. In 
case X consists of a complete stratification, x, represents r 
dummy variables. In the remaining of this article, r should 
be considered large in comparison with p x q. The popu- 
lation totals for Y and Z are the marginal frequency distri- 
butions with respect to education and employment. Using 
the common variable X, predictions for Y and Z can be 
defined with a multiple linear regression model: 


y= Bix. Re Len N' 
and 
2. =A'x,, I Wo INL 


where B and 4A are the ordinary least squares regression 
coefficients satisfying the normal equations 


N } N : 
SPA WS ae (1) 
k=1 k=] 


and 
a t e t 
Dee ee (2) 
k=l k=1 


The superscript ‘¢’ denotes transposition. This model is 
called a linear probability model, (see Maddala 1983, 
chap. 2). There are more elegant models, such as probit and 
logit models, to predict binary variables. However, we are 
not interested in the predictions themselves, but in the 
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synthetic population totals of these predictions. These totals 
appear to have nice properties if the linear prediction model 
is used, and for this reason the model can be justified. Note 
that B is calculated from the first registration and A from the 
second one. By means of the common variable X and the 
regression coefficients B and A, we construct a synthetic 
registration, which contains a record of each person k with 
scores x,,B'x,, and A'‘x,. In fact, either y, or z, may be 
added to this registration, but for our purposes this addition 
appears to be superfluous (see next paragraph). If there 
exists a vector a of order r of fixed numbers such that 
a‘x,=1 for all k, then the population totals of the new 
variables B ‘x, and A ‘x, equal the population totals of the 
corresponding original variables (see e.g., Bethlehem and 
Keller 1987). This can be shown easily by first pre- 
multiplying the normal equations (1) and (2) by a‘ and 
subsequently substituting a‘x,=1 into the resulting 
equations. 

From the synthetic registration, a synthetic two-way table 
can be defined by ),(B‘x,)(4‘x,)'. This synthetic 
two-way table can be considered as an approximation of the 
(simultaneous) frequency distribution )'7_, y, z,. Using the 
normal equations (1) and (2), the following identities can be 
derived: 


N N 
» (B'x,)(A'x,)'= » y,(A 'x,)! 
=] =1 


Clearly, the crossings between B‘x, and A'‘x,, y, and 
A'‘x,, or B‘x, and z,, all result in identical synthetic two- 
way tables. Therefore, it suffices to consider only 

h-1(B'x,)(A 'x,)', and delete either y, or z, in the 
synthetic registration. The difference between the real 
frequency distribution between Y and Z and its synthetic 
“approximation”, can be obtained from the following 
decomposition 


N J N 
So Bp oes 
k=1 k=1 


N 
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Note the strong resemblance with the ordinary variance 
decomposition in regression analysis (see e.g., Searle 1971). 
If either B‘x, = y, or A ‘x, =z, forall &, then the two-way 
table derived from the synthetic registration, equals the real 
simultaneous frequency distribution between Y and Z. 

Let / be a vector of appropriate order consisting of ones, 
and note that /‘y, = 1 and /‘z, = 1 for all k. If there exists 
a constant a such that a x, = 1 for all k, then we also have 


for all k, and similarly /‘2, =1'A ‘x, = 1 for all k. It follows 
that 


N N N 
PPB s alc Se) eee (4) 
k=1 k=1 k=1 
and 
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SB Ce yr = (Be) > Vp: (5) 
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So, the row and column totals of the synthetic two-way 
table, equal the corresponding marginal population counts 
with respect to Y and Z. 

What remains to consider, is the condition a‘x , = 1 for 
all k, for some constant a. This condition is satisfied if X 
represents a categorical variable. More generally, the 
condition is always satisfied if the vector X can be parti- 
tioned into two sub-vectors, one of which represents a 
categorical variable. 


2.2 Formulating the Constraints in Calibration 
Estimation 


Suppose a probability sample s of size n is drawn from 
the finite population Q = {1, ..., NW} according to a sampling 
design p(s) such that the first and second order inclusion 
probabilities Pr(kes) = 2, and Pr(k,/e€s) = 1,, are strictly 
positive. For each kes the vector of scores (x,, ,,2,) 1S 
observed. Two distinct registrations are available to provide 
auxiliary information. The first registration contains for 
each ke Q, records with scores on x, and y,, the second 
registration contains for each ke Q, scores on x, and z,. 
The objective is to estimate the YZ- table from the sample s, 
using auxiliary information from both registrations. There 
exists a wide range of weighting type estimators in the 
presence of multivariate auxiliary information. In Sarndal, 
Swensson and Wretman (1992), the general regression 
estimator is extensively discussed. It implicitly defines 
sample weights, which reproduce the known population 
totals of the auxiliary variables, used as control variables in 
the estimator. Such a consistency property is attractive if the 
auxiliary information is used both for publication and for 
weighting. As a generalization of the general regression 
estimator, the calibration estimator is developed (Deville 
and Sarndal 1992 and Deville et al. 1993). 
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To be specific, let G be a real valued function as defined 
in Deville et al. (1993) and consider the following 
weighting type estimator for our YZ-table: 


Dies Wi (0,24 )s (6) 
=| 


where w, is a scalar, representing a weight assigned to 
person kes. Denote d, = 1, "A calibration estimator for 
the YZ-table uses weights which are obtained by mini- 
mizing );.,4¢,G(w,/d,) with respect to w, subject to a set 
of constraints on w, for any particular sample s. We first 
consider the following set of constraints: 


n N n N 
») WE = » y, and De WyeE = % Ze (D 
k-l k=l k=l kl 


This (first) set of constraints only uses the (marginal) counts 
with respect to Y and Z. No use is made of the common 
variable X. One of the p + q equations is redundant, so to 
solve the minimization problem, one equation can be 
deleted. For G(w,/d,) = (w,/d, - 1)”, the resulting calibra- 
tion estimator corresponds to incomplete two-way strati- 
fication as defined in Bethlehem and Keller (1987). By 
taking G(w,/d,) = 1 + w,/d, (log (w,/d,) — 1), the classical 
raking ratio estimator is obtained (see e.g., Oh and 
Scheuren 1987). Copeland, Peitzmeier and Hoy (1987) 
have compared these methods, based on data of the Current 
Population Survey. They conclude that the estimates 
produced by the two methods are very similar. In Deville 
et al. (1993), two other distance functions are discussed, 
which are especially interesting in view of the problem of 
extreme weights. Estimating two-way tables with con- 
straints on the marginal counts, is frequently performed in 
sample surveys. Often, the constraints on the marginal 
counts are required for two reasons. The first reason is to 
reduce sampling error and sampling bias, and the second 
reason is to meet consistency requirements with published 
population counts. 

Suppose that x, is categorical with r categories. Since 
population information about the crossings between Y and 
X, and the crossings between Z and_X are available, we may 
also consider the following set of constraints: 


Ss WXE) > yy Vee and 
k=l kal 


n N 

t t 
> W,(2,X,) = Ss ZX - 
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The number of non-redundant constraints in this set equals 
r(p+q-l1). For large r, this set may be not feasible 
because it contains too many constraints in comparison with 
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the sample size. Only if 7 is small, the set may be of 
practical interest. In the remaining of this article, this set of 
constraints will be disregarded. 

In view of incorporating a large set of common variables 
in the weighting procedure, we consider a set of constraints, 
which exploits the bivariate population information that we 
have in the synthetic table: 


n N 
DA (Br (Ade) ay (Bi ep) (dike es (II) 
k=] k=] 


This (second) set of constraints is a straightforward 
application of the theory of calibration estimators. 
Population totals of the crossing between B ‘x, and A ‘x, 
are known, so these crossings are taken as auxiliary 
variables to formulate the set of constraints. Evidently, for 
large r, the number of non-redundant constraints remains 
bounded by px q. A major disadvantage of the resulting 
calibration weights is that, they do not necessarily 
reproduce the (marginal) population counts with respect to 
Y and Z, when applying these weights to y, and z, 
respectively. In other words, the resulting calibration 
weights do not necessarily satisfy the first set of constraints. 
Especially, if this set of constraints is formulated in view of 
consistency requirements, this is a serious drawback. 

Therefore, as an alternative, we consider a third set of 
constraints: 


n 


t 
Dy wil rez he te 


B'x,)(2,- A'x;,)')= 


N 
EB x, (A x,y. (I) 
k= 


Assuming that there exists a constant a, such that a ‘x pau 
for all k, this set of constraints meets the consistency 
objective. Let / denote a vector of ones of appropriate order 
and recall that u ‘y, =1'B ml EEA een! for all k, 
B re ed me ¥ Vy, and A ae ie Dae 12%: By pre- 
multiplying the third set of equations on both sides with /‘, 
we obtain the first set of constraints with respect to Z, and 
post-multiplying the third set on both sides with / gives the 
first set of constraints with respect to Y. The resulting 
calibration estimator can be expressed as 


n 


T=)" wiz.) = Se x, )(A ‘x, + 


k=l 
N 
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Clearly, this estimator obeys the decomposition given by 
(3). It equals the synthetically defined two-way table plus an 
adjustment term. This adjustment term is a calibration 
estimate for the difference between the real frequency 
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distribution between Y and Z and the synthetically defined 
two-way table. Similarly to the second set of constraints, the 
number of non-redundant constraints in the third set is 
bounded by px q. 

An important special case is G(w,/d,) = (w,/d, - 1)’. 
Then each estimated cell is a general regression estimate 
with (y, z,), vec(B‘x,x,A), and vec(y,z, - (y,- B'x,) 
(z,- A‘x,)') as control variables in case of the first, 
second, and third set of constraints respectively. Analytical 
formulas for the design variance of the general regression 
estimator, are given in e.g., Sarndal et al. (1992, chap. 6). 
In fact, these formulas are approximations for large sample 
sizes. In Deville and Sarndal (1992), sufficient conditions 
are given under which these approximations are valid for 
calibration estimators in general. 

In Deville et al. (1993), complete post-stratification is 
described as a calibration method for which all population 
counts with respect to the cross-classifications, are used in 
the set of constraints. An elaboration of complete post- 
stratification, results in the ordinary post-stratification 
estimator, regardless of the distance function G. As an 
alternative, incomplete post-stratification is described as a 
calibration method, in which less detailed than a complete 
knowledge of all cell counts, is used in the constraint set. 
The calibration estimator defined under the first set of 
constraints, is a commonly used example of incomplete 
post-stratification. Several cases are discussed, in which 
incomplete post-stratification is preferable to complete 
post-stratification. Two of them are, lack of population 
information and, some zero or extremely small cell counts 
(see also Oh and Scheuren 1987). The calibration estimator 
defined under the second and third set of constraints, 
corresponds to complete post-stratification in the sense that, 
all crossings are used as auxiliary information. Except when 
a perfect linear relationship exists either between Y anc_X, 
or between Z and X, the method differs from complete 
post-stratification in using synthetic population totals 
instead of real population counts. Complete post-stratifi- 
cation gives unstable results, if some sample cells have only 
few observations. In such situations, incomplete post-strati- 
fication is of practical interest. Similarly, the calibration 
estimator under the second and third set of constraints may 
be unstable. Analogously to incomplete post-stratification, 
one might consider using an incomplete crossing in the 
constraints instead. 


2.3 A Numerical Illustration 


We illustrate the calibration estimator under the three 
different sets of constraints by means of a hypothetical 
example. The example is based on real data from a sample 
on behalf of the Dutch National Travel Survey (1994). The 
sampling design is roughly a self-weighted cluster sample 
of addresses. All persons living in a selected address, are 
included in the sample. The net sample size is approx- 
imately 80,000 persons within 34,000 addresses. From this 
sample, two hypothetical registrations of approximately 
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N = 80,000 persons are constructed. In the one registration, 
age is registered (in six categories), and in the other 
registration, car ownership (in two categories). The 
common variable between the registrations is a key number 
for addresses, resulting in r = 34,000 categories -for the 
X-variable. For this particular example the synthetic 
two-way table simplifies to 


N 7 
Dd, B'x,)(A "x, = NY) 
Veal j=l 


where N. denotes the size of the j-th address, y; the mean 
of the six age categories of the j-th address, and z, the mean 
of the two car ownership categories of the j-th address. 

In order to calculate the synthetic two-way table, both 
registrations are combined as follows. Firstly, they are 
sorted according to the key number for addresses. Secondly, 
the address counts of the six age categories and the two car 
ownership categories are calculated. Thirdly, each address 
count of age, is linked with its corresponding address count 
of car ownership. By means of this synthetic registration of 
r = 34,000 addresses, the synthetic two-way table can be 
calculated. The result is shown in Table 1. This table can be 
considered as a first approximation of the real frequency 
distribution between age and car ownership. A sufficient 
condition for a close approximation, is homogeneity with 
respect to either age or car ownership within all addresses, 
i.e., all persons at the same address should either be in the 
same age category or in the same car ownership category. 
For most (multiple) person addresses, this seems to be an 
unlikely proposition. It follows from equations (4) and (5) 
that the row and column totals in table 1 coincide with the 
real (marginal) population counts of age and car ownership 
respectively. 

By means of a simple random sample of n = 1000 per- 
sons, the population cell counts are estimated using a 
general regression estimator. Three sets of auxiliary 
variables are used, in accordance with the three sets of 
constraints mentioned in the previous section. The estimated 
tables are given below (for convenience we have taken the 
quadratic distance measure: G(w,/d,) = (w,/d, - 1)*). The 
corresponding estimated standard deviations are within 
parenthesis. These estimates are based on the usual variance 
formulas of the general regression estimator, see Sandal 
et al. (1992, chap. 6). 


Table 1 
Synthetic Population Totals for Crossings Between Age 
and Car Ownership 


1 2 3 4 5 6 total 
yes 3461 1659 5739-10770 6536 3334 31499 
no 9827 4692 7902 17102 6424 5389 51336 
total 13288 6351 13641 27872 12960 8723 82835 
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Table 2 
Estimated Population Totals for Crossings Between Age and 
Car Ownership, Satisfying the First Set of Constraints 


1 2 BS 4 5 6 total 
yes 0) 0 @) 4968 (423) 15414 6543) 7518 (458) 3599 375) 31499 
NO 13288 @ 6351 @) 8673 (423, 12458 (543) 5422 (45g) 5124 (375) 51336 


total 13288 6351 13641 27872 12960 8723 82835 
Table 3 
Estimated Population Totals for Crossings Between Age and 
Car Ownership, Satisfying the Second Set of Constraints 
1 2 3 4 5 6 total 


yes Die 5 Vee 479Ligiy! 13826 ait, 6887 aay uid42 lary 28923 qos 


noi WN43B5%.,, 7012555) (8118s, AA8OS ese 5853, 95654 aaa ssOlD wae 


total 14385, 7012555) 12908 ¢603) 26718 ose) 12739 a1) 9074 ax) 82835 


Table 4 
Estimated Population Totals for Crossings Between Age and 
Car Ownership, Satisfying the Third Set of Constraints 
1 2 3 4 5) 6 total 


yes ye Og) 5501 cy 15647 gin, 6898 Gar” 345 cay M1499 


BO 13288 4 635].6 , 813915 12224 oon 0. 0002 urahe S210 a eles 


total 13288 6351 13641 27872 12960 8723 82835 


In Table 2 the population counts are estimated according 
to the ordinary incomplete two-way _ stratification 
(Bethlehem and Keller 1987). There are no young people 
(age category 1 and 2) owning a car, observed in the 
sample, which is likely to be representative for the popu- 
lation, so these cells are estimated by zero. Due to the 
consistency requirements, i.e., the first set of constraints, it 
follows that the estimated cell counts of young people 
without a car equal the corresponding marginal cell counts. 
An attempt to improve Table 2, is to use the common 
variable address in the weighting procedure. In Table 3, the 
cell estimates are given according to the second set of 
constraints. As already mentioned in the previous section, 
the estimated row and column totals may differ from the 
real population counts. A comparison between Table 2 and 
Table 3 shows that these differences can be considerable. In 
addition, almost all estimated cell counts in Table 2 have 
smaller estimated standard deviations than the correspon- 
ding estimated cell counts in Table 3. So, the second set of 
constraints gives quite unsatisfactory results. The third set 
of constraints covers the first set of constraints. This implies 
1) consistency of the estimated marginal cell counts with 
respect to the corresponding known population cell counts, 
and 2) smaller asymptotic variances of all estimated cell 
counts. The results are shown in Table 4. Indeed, the 
estimated marginal cell counts are consistent, and the 
estimated standard deviations are at most half of the 
corresponding standard estimates given in Table 2. 
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2.4 Imputing Values of the one Registration into the 
Other Registration 


Until now, we have developed a weighting method to 
estimate a two-way table between two variables, which are 
registered in two distinct registrations. Often, one is in- 
terested not only in estimated two-way tables, or more 
generally, estimated linear relations, but in complete 
registrations in which both variables are simultaneously 
registered. Users of statistics find such complete data-bases 
easy to analyze. The creation of such enriched registrations 
can be seen as a special case of imputation. One registration 
serves as a host or recipient source, and the other as a donor 
source. Assuming the second registration to be the donor 
source, the problem is imputing Z-values from the second 
registration, into the first registration using the estimated 
two-way table discussed in Section 2.2, as auxiliary 
information. Statistical matching problems using data from 
a third data source, have already been considered by Rubin 
(1986) and Paass (1986). Singh et al. (1993) gives a review 
of their methods. In addition, they propose some modifica- 
tions to Rubin’s (1986) and Paass’s (1986) methods. Our 
imputation method is based on the regression method 
suggested by Rubin (1986) and Singh ef al. (1993). 

After having defined predictors for the Z-variables by 
means of the regression model 


ee rey Sas ee 


where A is given by (2), we define new predictions for these 
variables by means of the enlarged regression model 
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Using well-known results about partial regression 
coefficients in the general linear model (see e.g., Seber 
1977), a, and a, can be expressed as 


a, = Ae Ba., 
and 
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where B and A are given by (1) and (2) respectively. They 
can be calculated from the first and second registration. The 
partial regression coefficients should be estimated from the 
third source. We suggest 


and 
=f 
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where w, are calibration weights which are discussed in 
Section 2.2. Based on these estimates we define new 
predictions for the Z-values: 


Z,=6! x,+@y,=A'x,+@,(y,-B'x,), k=1..N. 


These new predictions equal the old predictions (see 
Section 2.1) plus an adjustment term. This adjustment term 
depends on the difference between the Y-value and its (old) 
prediction. It can be viewed as an attempt to improve the 
prediction for Z, however, and more important, it is a means 
to reconstruct the weighting type estimator under the third 
set of constraints (Section 2.2). Indeed, the following 
equality holds: 


N a N 
Moye = Baa ny + 
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This is just the weighting type estimator under the third set 
of constraints, if the corresponding calibration weights are 
used to estimate a.,. It is easy to show that 


N Pere ae 
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So, also the XZ-table can be reconstructed. At the beginning 
of this section, we assumed the second registration to be the 
donor source. This choice was arbitrary. If the Y-values 
were imputed instead of the Z-values, we would have 
obtained an identical estimate for the YZ-table. In addition, 
the XY-table could have been reconstructed. 

The new predictions for the Z-values can be used for 
imputation. Singh ef al. (1993) give algorithms for impu- 
tation using regression models. These Z-values can be 
imputed in the first registration in two steps. In the first 
step, the predictions given by (7) are calculated for each 
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(x,,¥,) in the first registration. We have shown that the 
crossings between the Y-values and these predicted 
Z-values, can be considered as weighting type estimators. 
However, the calculated predictions have in general no 
realistic values, and therefore the first step is followed by a 
second step. In the second step, each predicted Z-value in 
the first registration is replaced by a live Z-value from the 
second registration, which is nearest under some Euclidean 
distance in (X, Z ). 


2.5 Estimating Cross-Products for Continuous 
Y- and Z-Variables 


The consistency property of the third set of constraints 
(Section 2.2) also hold with respect to continuous Y- and 
Z-variables, provided that there exist constants a, and a, of 
proper order, such that AV, = 1 and az, =1 forall k. To 
see this, we slightly extend the results of Section 2.1. First 
note that 


N N a 
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(it is still assumed that there exists a constant a such that 
a'x, = 1 forall k). Similarly, it holds that aA ‘x, = 1. The 
equivalent equations of (4) and (5) for the continuous case 
are readily obtained. Consequently, pre-multiplying both 
sides of (II) with a, gives )y.)w, 2, = Lj-12 and 
post-multiplying both sides of (II) with a, yields 
Vie; = Veay,- So, the third set of constraints meets 
the consistency objective, i.e., the calibration equation of 
the first set of constraints, for quite general Y- and Z- 
variables. We will give two examples. 

In the first example we take y,=(1,y,,)' and 
ZC ai , where both y,, and z,, are assumed to be 
continuous. By taking a,=a_=(1,0)' we see that 
ay y =a,z=1 forall k. The cross-product between Y and Z 
equals 


N 
D2 = e ; 
DY De 


V4.7 24 


from which the covariance between y,, and z,, is easily 
derived. This cross-product can be estimated using the third 
set of constraints. An elaboration of this set gives the 
following four constraints for this particular example: 


n n N n N 
De w,=N, aD WVoe= dy VK » WiZ—= oy a 
k=l k=l kal k=l k=l 
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and 


n 


Wy (Von Zo¢ 7 (Voy 7 By x) (2547 Ay x,)) Ff 
kel 
N 


dE (By x;) (43 x,), 


k=] 


where the regression coefficients are given by 
and 


If one is specially interested in the correlation coefficient 
between y,, and z,,, then following constraints may be 
considered in addition: 


n N n N 
»s WV oe 7 Da we and bs Wy Zoe i yy ae 
k=l kal kl k=l 


In the second example, we suppose that y, = (1, y,,)'; 
where y,, may be continuous, and z, is categorical with g 
categories. By taking a, = (1, 0)' and a, =/, where / is a 
vector of ones of proper order, we see that a, yen a, Zee 
for all k. The cross-product between Y and Z is 


where C, denotes the set of population elements belonging 
to the A-th category of Z, and N, the size of C,. It is 
ensured that the calibration weights according to the third 
set of constraints, satisfy the ‘marginal’ calibration equa- 
tions Vea 42 = ei Ze = ia Na, and vee eVox is 
12,» Which both may be of interest in view of 
consistency requirements. 


3. COMBINING INDEPENDENT SAMPLES 
ACROSS COMMON VARIABLES 


In the previous section, we have presented a method for 
combining two registrations across common variables, 
using auxiliary information from a small sample. In this 
section, the method is adjusted by combining two inde- 
pendent samples. We consider a complete registration of 
persons, two large-scale sample surveys, and a small-scale 
sample survey. The registration contains a limited set of 
variables such as sex, age, region, and marital status. These 
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variables are denoted by_X. In the one large sample, the 
variables Y, U, and X are observed, and in the other large 
sample, the variables Z, U, and _X. In the small sample all 
variables, i.e., Y, Z, U, and X, are observed. The small 
sample could come from a specially conducted small-scale 
survey, or from sample overlap of the large-scale surveys. 
In Figure 1, the data sources are schematically given. For 
convenience, it is assumed that all samples correspond to 
different units, i.e., it is assumed that there is no sample 
overlap. 


registration 


X,U,Y first large sample 
x 
[= aes 
J small sample 
X,U,Z second large sample 
Figure 1. Overview of the Several Data Sources 


The common variables X and U are partitioned into 
C =(X U), where_X denotes the set of common variables 
with known population totals, and U denotes the set of 
common variables with unknown population totals. All 
samples may be drawn by some complex sampling design. 
Both Y and Z are assumed to be categorical, however, as in 
Section 2.5, the suggested weighting methods are also 
applicable for continuous Y and Z. The purpose is to 
estimate the two-way table between Y and Z. We consider 
two estimators. One estimator is based on incomplete 
two-way stratification (analogous to the first set of 
constraints of Section 2.2), and the other estimator is based 
on a mix between statistical matching and calibration 
(analogous to the third set of constraints of Section 2.2). 


3.1 Incomplete Two-Way Stratification 


First the population totals of Y and Z are estimated by 
means of the first and second (large) sample respectively. 
These population totals are estimated in two phases. In the 
first phase, both (large) samples are weighted using X as a 
set of control variables. This implies that both (large) 
samples are weighted such that they reproduce the known 
population totals of X, which are denoted by ¢,. Based on 
these weights, a pooled estimate for the population totals of 
Uis 


t, =~) Wu, + CU - D> Wo Uy 


ken, ken, 


where w,, and w,, denote the (first phase) calibration 
weights of the first and second sample, and A€[0, 1]. In 
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the second phase, both samples are reweighted using 
simultaneously X and U as control variables. Let v,, and 
v,, denote these second phase calibration weights. The 
resulting estimators for the population totals of Y and Z can 
be considered as calibration estimators in two phases (see 
Renssen and Nieuwenbroek 1997, Section 6). These 
estimators are denoted by t and i respectively: 


f, = ye ¥_¥, and t. =e WV yp2y: 


ken, ken, 


We note that both estimators are based on a similar set of 
control variables. If the common set of variables is large, 
one may consider using a smaller subset to weight both 
samples. In general, the subset to weight the first sample 
may differ from the subset to weight the second sample. 
However, we shall assume in the sequel that both (large) 
samples are weighted according to the same set of control 
variables. 

The two-way table between Y and Z can be estimated by 
weighting the (small) third sample, using simultaneously Y 
and Z as control variables, i.e., 


T = De W304 24)» 


ken, 
where the calibration weights w,, satisfy the constraints 


a2 W3i Vx =t, and Ds Wop 2, =t,. 


ken, ken, 


This is incomplete two-way stratification, where the 
unknown population totals of Y and Z are replaced by their 
estimates. These sets of constraints ensure precisely 
estimated marginal counts of the YZ-table if the common 
variables C are highly correlated with Y and Z. 


3.2 Synthetic Two-Way Stratification 


In this section, we consider an alternative estimator for 
the YZ-table, which also uses the (large) samples as a source 
of auxiliary information. However, instead of using 
estimated marginal counts as auxiliary information, esti- 
mated synthetic cell counts are used. Let B denote the 
population regression coefficient between Y and C, which 
is estimated by the first (large) sample: 


e = ; 
B | Ys vues] ( ys: vue) 


ken, ken, 


Similarly, let A denote the population regression 
coefficient between Z and C, which is estimated by the 
second (large) sample: 


s \-1 
A A Das vere oS; vei 


ken, ken, 
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Note that these estimated regression coefficients are based 
on the second phase calibration weights instead of the 
inclusion weights. If there exists a constant a, such that 
a‘c,=1 for all k, then we still have 1'B re =]'4 ey = 
for all k. Now, inspired by the decomposition given by (3), 
Ve: 


N : N : 

sy, V7 k =B'y’ (c,c;, )A + 

k=1 k=1 
N 
ok Cy, eB} AIH ee ‘Cy 
k=1 


we suggest estimating the two-way table in two steps. In the 
first step the first term on the right-hand side is estimated by 
substituting the population regression coefficients B and A 
by their estimates B and A. Furthermore, we suggest to 
estimate )’ pret ix by the pooled estimate: 


S =yyov vege ey sa Leap yye Vo (CyCy )s 


ken, ken, 


where v,, and v,, denote the (second phase) weights of the 
first and second ome and yeé[0, 1]. Eventually, the first 
term is estimated by B SA . Until now, no use of the 
third (small) sample has been made. If desired, estimates for 
B, A, and )’, can be improved slightly by also using the 
small sample. 

In the second step, the complete two-way table between 
Y and Z is estimated by weighting the third (small) sample 
according to the calibration estimator subject to the third set 
of constraints (see Section 2. 5) where B, A, and )’. are 
replaced by their estimates B, A, and Nae The resulting 
estimator equals 


A 


YW, (426) = By), 4+ 


» W3(V,- Bie, )(- A '¢,). (8) 

The first term on the right-hand side is an estimate for 
the synthetic two-way table. This estimate is approximately 
unbiased for the ¥Z-table, if the conditional independence 
assumption holds. We note that, this type of estimator is 
essentially obtained by applying the constrained statistical 
matching method (see e.g., Barr and Turner 1980, 
Rodgers 1984, or Rubin 1986). The second term is an 
adjustment term to obtain an approximately unbiased 
estimate for the YZ-table, without this assumption. If there 
exists a constant a such that a‘c,=1 for all sampled 
elements, then we obtain by pre-multiplying both sides of 
(8) with /‘, the following estimator for the population total 
of Z: 
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» WaiZ¢ = [1 Vince ilo YD» vact) A= 


ken, ken, ken, 


os vet] Ate a pz racic! Auge 


ken, 


Similarly, we have by post-multiplying both sides with /, an 
estimator for the population total of Y: 


Sy Waa = Bd ViecEs Ae pe Yar] = 


ken, ken, ken, 


B ( y a) -B | > nace] a=, 


ken, 


It follows that the marginal cell counts of the estimated 
two-way table, are the two-phase calibration estimators for 
the population totals of Y and Z as defined Section 3.1. 


3.3 A Simulation Study; Integration of Household 
Surveys 


In this subsection, we wish to compare the weighting 
techniques incomplete two-way stratification as discussed 
in subsection 3.1, and synthetic two-way stratification as 
discussed in subsection 3.2, by means of a simulation study. 
To that purpose, we use a data set, which stems from a pilot 
study of the Dutch Household Survey on Living Conditions, 
(see van Tuinen 1995). The data set consists of 1,085 
records of which the following variables are observed: age 
(six categories: 15-24, 25-34, 35-44, 45-54, 55-64, 65+), 
sex (two categories: male or female), ownership of house 
(two categories: yes or no), occupation (five categories: 
work, housekeeping, education, voluntary, other), and 
health (two categories: yes or no). On behalf of the simu- 
lation study, this data set is considered as a finite 
population. The population totals of age and sex are 
assumed to be known. 

In order to simulate the weighting techniques, we have 
carried out a Monte Carlo algorithm. Namely, we have 
drawn 500 samples, independently of each other, according 
to a two-phase sampling design. In the first phase, a simple 
random sample of size 20,500 is drawn with replacement. 
In this sample, age, sex, and ownership of house, are 
observed. In the second phase, the (first phase) sample is 
randomly divided into two large sub-samples of sizes 
10,000 and one small sub-sample of size 500; in the one 
large sub-sample, occupation is observed (denoted by Y), in 
the other large sub-sample, health (denoted by Z), and in the 
small sub-sample, both occupation and health are observed. 
At each run, we have estimated the two-way table between 
Y and Z, according to four weighting methods which are 
discussed next. 
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The first phase sample is weighted with a crossing 
between sex and age as control variables. This is just 
post-stratification with twelve post-strata. Based on these 
weights, population totals can be estimated for all observed 
variables in the first phase sample, and for crossings 
between them. In particular, we may reproduce the popu- 
lation totals for the crossing between age and sex, and 
obtain estimated population totals for the crossings between 
age, sex, and ownership of house. Now, we distinguish two 
sets of common variables to weight the large sub-samples, 
as well as to obtain an estimate for the synthetic two-way 
table between Y and Z. The first set is a crossing between 
age and sex (12 categories) and the second set is a crossing 
between age, sex, and ownership (24 categories). For each 
simulation, this gives two different estimates for the 
marginal counts, i.e. two different estimates for the 
population totals of Y and Z — note that both estimates are 
based on post-stratification — and two different estimates for 
the synthetic two-way table. In order to weight the small 
sub-sample, we distinguish between the weighting method 
based on incomplete two-way stratification, and the 
weighting method based on synthetic two-way strati- 
fication. Since two different sets of common variables are 
used to weight the large sub-samples, as well as for 
statistical matching, we obtain four sets of calibration 
weights for each simulation run with respect to the small 
sub-sample, which in turn gives for each simulation run, 
four different estimated two-way tables between Y and Z. 
For the ease of computation, we have used the quadratic 
distance measure in the calibration estimation, implying that 
each estimated cell corresponds to a general regression 
estimate. Finally, we have taken the averages and variances 
of these two-way tables over the 500 simulations. The 
results are shown in tables 5 to 8. 

The averages over the 500 simulations are almost 
identical for the four types of estimators, as can be seen 
from these tables. Note that the given cell counts are 
rounded off. We have also calculated the real YZ-table from 
the finite population. The real counts equal exactly the 
averages, which are given in Table 5 (or 6). For this 
particular simulation study, we conclude that all estimators 
have a very small bias. 

The variances over these 500 simulations are given 
within the brackets. The variances of the estimated marginal 
counts of Tables 5 and 7 coincide, because these estimates 
are based on the same estimator. For the same reason it 
holds that the variances of the estimated marginal counts in 
tables 6 and 8 coincide. Note that the variances of the 
estimated marginal counts in tables 6 and 8 are slightly 
smaller than the variances of the estimated marginal counts 
in Tables 5 and 7, due to the larger set of common 
variables. However, for most estimated marginal counts this 
variance reduction can be considered negligible. 

Tables 5 and 6 give identical variances with respect to all 
estimated cell counts. The variances for most estimated cell 
counts in Table 7, are plainly smaller than those in tables 5 
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and 6. In Table 8, this variance reduction is even greater. 
For this particular example, we conclude that the use of the 
larger set of common variables, in combination with the 
first weighting method, slightly reduces the variances of the 
estimated marginal counts, but leaves the variances of the 
estimated cell counts unaffected. Naturally, using the larger 
set of common variables in combination with the second 
weighting method, also slightly reduces the variances of the 
marginal cell counts. Finally, given a set of common vari- 
ables, the weighting method based on synthetic matching, 
results in smaller variances for the estimated cell counts, 
than the weighting method based on incomplete two-way 
stratification. 


Table 5 
Incomplete Two-way Stratification Combined with the First Set of 
Common Variables 


1 2 3 4 5 total 
yes emty 39 898) Le 59.49) 85205, 
no Gian 104,45 tiles Als 46.46) 2330) 
total 508.3, °336u9, 100) 364) 105.) 1085 
Table 6 


Incomplete Two-way Stratification Combined with the Second Set 
of Common Variables 


1 2 3 4 5 total 
yes BAT 28 on 898) Sas 59.5) 352 
no 6 lq) 10 4.99) Moy 11 yo) 4646) 23347 
total 50843, 33645, «(100% 36. 105) 1085 
Table 7 


Synthetic Two-way Stratification Combined with the First Set of 
Common Variables 


1 2 3 4 5 total 

yes Pay sew 89... 256,, 59.49) 851m, 

no Gi OS ses 11, ties, 46 a8 234.47 

fotali S08 ey 336 uneme 100.5 360) 105q0, -:1085 
Table 8 


Synthetic Two-way Stratification Combined with the Second Set of 
Common Variables 


1 2 3 4 5 total 
yes dejo Oat) 89 Zaye One aia, 
no Gl TOS NEO te ne 467 234.7 
fOtalpha SOS 54, 06336 paaleslOO js: 36.) 105», 1085 


3.4 Imputing Values of the one Large Sample into 
the Other Large Sample 


By means of the two large samples and the small sample, 
one may construct a synthetic sample in which the real 
Y-values and predicted Z-values, and/or the predicted 
Y-values and the real Z-values are simultaneously recorded. 
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We define predictions for the Y- and Z-values analogously 
to (7), namely 


A 


Sena Bic, +B, (z,-A‘c,), k= 1 EE (9) 


and 


= A'c,+@ (y, - B'c,), k = lnmting, (10) 


ny 
BS 2 Dov, —A eG,-A 2) x 


ny 
» Wy, (y,-B'c,)(z,-4'¢,)', 


and 


ny 
a, = SS Vig VE cy, = Bice 
k=l 


3 
> Wa, (, - B'c,)(z, -A‘c,)|- 


For each (c,, y,) the Z-values can be imputed in the first 
large sample by means of (10), k =1 n,, and similarly 
for each (c,,z,) the Y-values can be imputed in the second 
large sample by means of (9), k = 1, ...,,. Based on these 
imputed values, we may define the following estimates for 
the two-way table between Y and Z: 


a ny 

. te, 
Ds wie ee Die c, At 
=I = 


es 
SAN yp AB Cp) ped ee) a ell) 
k=1 


ny 
>, Wy,( y, -B'c,) (z,-A'c,)'. (12) 


One estimate is based on the first synthetic sample, the 
other on the second synthetic sample. By pooling the 
synthetic samples, one obtains a pooled synthetic sample of 
size n,+n,, from which a pooled estimated for the 
two- ny table can be constructed. This pooled estimate 
shows a close resemblance to (8). Note that if C and Z are 
perfectly correlated, then the left-hand side of (11) reduces 
to ee iui z,, .e., our estimated two-way table corres- 
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ponds to a weighted estimated two-way table based on the 
first sample, as if the real values of Z were imputed in this 
sample. Similarly, if C and ¥ are perfectly correlated, then 
(12) reduces to Y32,v Vip g21 : 

An important special case to consider, is when c is 
categorical. Then the following equalities hold true: 


t 
= diag} “|, 
t, 


so (11) and (12) coincide. Furthermore, we have for 
categorical c: 


pes VielCg Ce) z Pa, Vpcere,) 


ken, ken, 


ds Vin ZK = =) Vipepzp 


ken, ken, 


and 


> VoD ilk = oy Wace 


ken, ken, 


Obviously, if c is categorical, then it suffices to create a 
synthetic sample, which is based on either the first synthetic 
sample or the second synthetic sample. In either case, the 
weighting type estimates for the CZ-table, the CY-table, and 
the YZ-table, can be reconstructed. Finally, we note that the 
imputed values in all synthetic samples may be unrealistic. 
As described in Section 2.4, the calculated predictions may 
be replaced by live values according to some algorithm. 


4. SUMMARY 


In this article we presented a weighting procedure to 
combine information from distinct sample surveys. The 
linking pin between these surveys, is a set of common 
variables, (see Figure 1). It is argued that these samples 
should be weighted according to a sequential structure. 
First, both large samples were weighted using X as control 
variables. Based on these weighted samples, we could 
obtain a pooled estimate for the population total of U. Then 
both large samples were reweighted using simultaneously 
X and U as control variables. This gave an estimate for the 
population total of Y and Z. 

Using statistical matching techniques with X and U as 
common variables, we may also obtain an estimate for a 
synthetic two-way table between Y and Z. Eventually, the 
small sample was weighted according to two different sets 
of control variables. The first set of control variables 
corresponded to the estimated population totals of Y and Z, 
and the second set of control variables to the estimated 
synthetic two-way table. Using the first set of control 
variables, is strongly related to incomplete two-way 
stratification. The theoretical framework needed to develop 
the second weighting method, was discussed all through 
this article. By means of both weighting methods, the 
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YZ-table can be estimated (it is tacitly assumed that Y and 
Z are categorical). The marginal counts of the YZ-table 
corresponding to the first weighting method, equal by 
definition of the calibration equations, the estimated 
population totals of Y (which is based on the first large 
sample) and Z (which is based on the second large sample). 
It was shown, that this consistency property also holds for 
the second weighting method. A numerical study was 
conducted to evaluate the performance of the weighting 
methods with respect to the cell counts. It was found that 
both weighting methods yielded nearly (design) unbiased 
estimated two-way tables. The simulated (design) variances 
of the second weighting method, appeared to be smaller 
than the corresponding (design) variances of the first 
weighting method, with respect to all estimated cell counts. 
In principle, the Y- and Z-variables were assumed to be 
categorical, however, it was argued that the ideas presented 
were also applicable for continuous Y and Z or for 
continuous Y and categorical Z. 
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Sampling on Two Occasions: Estimation of Population Total 


RAGHUNATH ARNAB' 


ABSTRACT 


Two sampling strategies have been proposed for estimating the finite population total for the most recent occasion, based 
on the samples selected over two occasions involving varying probability sampling schemes. Attempts have been made to 
utilize the data collected on a study variable, in the first occasion, as a measure of size and a stratification variable for 
selection of the matched-sample on the second occasion. Relative efficiencies of the proposed strategies have been 


compared with suitable alternatives. 


KEY WORDS: Composite estimator; Matched-sample; Sampling schemes; Sampling strategies; Varying probability 


sampling schemes. 


1. INTRODUCTION 


We very often survey the same population at regular time 
intervals to estimate the same population characteristics 
which change over time. For example, many countries 
collect data to estimate total number of unemployed 
persons, HIV infected people, immigrants efc., on an annual 
or quarterly basis. In this article, we consider a finite 
population U = (U,,..., U,, .... Uy) of N identifiable units, 
which is supposed to be sampled over two occasions, to 
estimate the population total of a variable under study for 
the current (second) occasion. In successive sampling, one 
utilizes data collected on the previous (first) occasion 
effectively, to get an efficient strategy in consideration of 
cost, and providing an efficient estimator of the population 
total for the current occasion. Extensive literature is now 
available for this purpose. Singh (1967), and Avadhani and 
Sukhatme (1970) utilized information, collected on the first 
occasion as a measure of size, for the selection of the 
matched sample on the second occasion; while Arnab 
(1991) utilized such information as a stratification variable, 
as well as the measure of size, for selection of the sample 
on the second occasion. Recently, Prasad and Graham 
(1994) modified Raj’s (1965) and Chotai’s (1974) 
sampling strategies, by using information of the first 
occasion as a measure of size, for the selection of the 
matched sample in the second occasion. They found 
empirically, that one of their proposed strategies fares better 
than that given by Chotai (1974). In this article, two 
alternative strategies are proposed. One of them utilizes 
information in the first occasion as a measure of size, and 
the other utilizes information as a measure of size and also 
as a Stratification variable for selection of the matched 
sample in the second occasion. In this paper, it is shown 
that one of the proposed strategies is better than that given 
by Prasad and Graham (1994) and for the other, we do not 
have any definite theoretical conclusion. However, 
empirical evidence shows that the latter is more efficient 


than that described by Prasad and Graham (1994), as well 
as the former proposed strategy. This is possible because it 
utilizes first occasion values in all possible stages viz., 
stratification, estimation and selection of the matched 
sample in the second occasion. 

The general methods of selection of samples and 
estimation over two occasions are described below. 


1.1 Sampling Schemes 


On the first occasion, a sample Sat of size n, is selected 
by some suitable sampling design, say P,, and the data 
y,; 1€8,, 18 obtained where y,,(y,,) is the value of the 
variate y under study, for the i-th unit on the first (second) 
occasion. On the second occasion, a matched sample 
(sub-sample) s, of size m(=nk, assumed to be an 
integer, 0 << 1) is selected from s, by some suitable 
sampling scheme P,,, and it is supplemented by an 
un-matched sample s, of size u( =n =n-m,p=1-d) 
either from the entire population U or from U/s,, the set of 
units not selected in the first occasion, by some suitable 
sampling design P,, and information y,, (ies, , i@s,) on 
the second occasion is obtained. It is obvious that the cost 
of survey for the matched sampled units is expected to be 
much lower than that of the un-matched units, but for the 
sake of simplicity, we assume that the cost of the survey 
remains the same for all the units in the second occasion. 


1.2 Method of Estimation 


From the data y,,, ies,, and Yo; TES,» collected through 
the initial sample s,, and the matched sample s,,, an 
unbiased estimator Y,,, for Y,, the population total for the 
second occasion, is formed by treating the y,,’s, ie€s,, as 
auxiliary information. Thus Y,,, is normally a difference, 
ratio or regression estimator. From the un-matched sample 
s,, an unbiased estimator re is also constructed for Y,. 
Finally, a composite estimator, a combination of Y, and 
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re is obtained by using a suitable weight of g (0 < g < 1), 
as 
ses =9Y >, +( — 9) a (1) 


The optimum value of @=@(A) is obtained by 
minimizing V(Y. 5), the variance of 1G with respect to @, 
for a given value e m (i.e., 4). The expressions for @ (A) 
and V(Y, Th), the variance of rs with @=@(A) are 
obtained as follows, when Nb and ¥,, are independent: 


PAs=C/V \M/V FV I, 
VEIT ety ety i 


where V, and V, are variances of re and ee 
respectively. The optimum proportion of matehed Saniple 
X =h,, is obtained by minimizing V(Y, IX) with respect 
to ius Finally, putting =A, in the expression for 
V(Y, Th), the minimum variance of i is obtained, and it 
will be denoted by V__ ; )= VY, Ih,.). Our object is to 
find a suitable strategy, which is a combination of 
[P= (i PS Pe) rand Y. , to control the magnitude of 
Vand, ) to a minimum. 


1.3 A Few Sampling Strategies 
1.3.1 Avadhani and Sukhatme (1970) 


On the first occasion, the initial sample s, of size n was 
selected by simple random sampling without replacement 
(SRSWOR) method, assuming that no auxiliary information 
is available prior to this survey. On the second occasion, the 
matched sample s,, of size m was selected from s, by the 
Rao, Hartley and Cochran (RHC, in brief, 1962) sampling 
scheme using y,, as a measure of size for the i-th unit ies, 
assuming y,,’s are positive. Under the RHC sampling 
scheme, the selected n units of s,, are divided at random 
into m groups, each of size n/m, miich | is assumed to be an 
integer. From each of the selected groups, one unit is 
selected independently with probability proportional to the 
measure of size. Thus if the 7-th unit, Bee belongs to the 
j-th group G;(j =1,...,m) then U, will be selected with 
the probability qi *(ies,) = Vil hie y,;- The un-matched 
sample s, was selected from U/s, by SRSWOR. 


1.3.2 Chotai (1 


On the first occasion, the initial sample Sy of size n was 
selected by the RHC scheme of sampling (assuming N/n is 
an integer), as described above with probability propor- 
tional to z,, the size measure for the i-th unit which is, 
assumed to be positive and known for every ieU. Let 
Ais Likes, p,, the sum of p,(=2,/Z,Z=)),142,) values 
that belong to the random group G (j =1,...,”), which is 
formed in selecting the sample s, by the RHC method. The 
matched sample s,, was elected from s, by the RHC 


scheme, with normed size measure A., for the i-th unit 
i€s Ol ns A,=1) assuming n/m is an integer. The 
un- -matchdd sample, s,, was selected by the RHC sampling 
scheme with normed size measure P for the i-th unit 
assuming N/u is an integer. Let P, (P/) = total of the 
A, (p;) values associated with those units that belong to 
the random group from which the i-th unit was selected in 
Sm(S,) by the RHC sampling scheme with Yj, P; = 1 
(ge Pit £ 1). 
The composite estimator for Y, is given by 


Vie=iQd pgcb(hin@)ly} 


where 


bei yy Gap yR, ef 


eS, 


v8 8 OT Nae (Y,,/P,) A; |; 


eS), 1€S) 


B=) Da Co DON. (2) 


eS) 


where y is a suitably chosen constant to minimize variance 
of Y,,,. Chotai (1974) derived the expression for the 
minimum variance of Y, as 


V.n(¥>) =kL1-f+Vv (1 - 8°)] 05/2 =V, (say) (3) 


min 


where 


k=N/{n(N- 1)}, f=n/N, 


° az ye PV! P; - Wises 1,2 
ieU 


Y= )e Yypdiz 1,2 
ieU 
(4) 


=a P; (y5)/ Pie Y,) O%,/Pi= Y,)/(G, 6). 
ieU 


1.3.3 Arnab (1991) 


Arnab (1991) presented several strategies where the 
initial sample s, was selected by probability proportional 
to size with replacement (PPSWR) using normed size 
measure p, = 2z,/Z for the i-th unit. Utilizing the ascertain 
values y,,’s (i€s,) on the basis of certain criteria, the n 
sample units are assigned to a suitable number of L strata. 
Let s,, be the sample of size n,, belonging to the h-th 
stratum (s, = U,,s,, and )),”, =n). Here, itis assumed 
that 7 is large enough to ensure that 7, is positive for every 
hin practice. On the second occasion, sub-samples s,,,,’s 
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of size m,,’s (=v, n,, V,, is a predetermined fraction and m, 
is assumed to be an integer) are selected from s,,’s 
independently, by suitable sampling schemes involving 
Y1;'S tes, in the selection of matched samples Se isaiche 
unmatched sample s, is selected by PPSWR method from 
the entire population U using z/ as measure of size. 


1.3.4 Prasad and Graham (1994) 


Here the initial sample s, is selected by the RHC scheme 
of sampling similar to Chotai (1974) with normed size 
measure p, = z,/Z for the i-th unit. The matched sample s,_, 
is Selected from s, by the RHC scheme with 
Dy = (Vy A,/P)! Lies, (,,A/P,) for the i-th unit, ies,; 
where A, is the sum of the p, values for the group 
containing the i-th unit, formed in selecting s, by the RHC 
sampling scheme of sampling. The un-matched sample, s_, 
was selected from the entire population U by the RHC 
scheme similar to that presented by Chotai (1974). Here 
also N/n, n/m and N/u are assumed to be integers. Prasad 
and Graham (1994) proposed the following composite 
estimator for Y,: 

Y,=@Y,,,+U- 9) ¥, 
where YF, = Dies, (Yai! Pi) P,5 Poy = Lies, Vos! PP/'s 
Vos ay, ie : i (P/ ) = total Lu the P; aK p;) values 
associated with those units that belong to the random group 
from which the i-th unit was selected in s,,(s,). The 
expression for minimum variance of Y. tS obtained as: 


V... (¥,) =k (1 - f+ VE) 65/2 = Vog (say) (5) 


where 
2, 32 
G = 03/05, 03 = py qi Voi! 9; are Gi=MY/% (©) 


Sue 05 and Y, are defined in (4). A 

In Prasad 3 Graham’s (1994) expression for V_,;, (Y,), 
the divisor 2 was omitted and is obviously a typographical 
error. 


Remark 1.1 


From the strategies described in section 1.3, we note that 
the Avadhani and Sukhatme (1970) scheme does not 
require information on size measures in the whole frame, 
and hence is less demanding than the others. Chotai (1974) 
used the original size measures p, in selection, but the first 
survey values y,,’s, i@s, were used additionally in 
estimation only. The use of additional information, p,’s, for 
the selection of the initial sample s, will make Chotai’s 
(1974) strategy more efficient than that of Avadhani and 
Suhkatme (1970). But to use the optimal estimator Y, for 
the Avadhani and Sukhatme (1970) strategy, one needs to 
estimate @, the only unknown parameter. However, in 
Chotai’s (1974) strategy, both the parameters » and y have 
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to be estimated in order to use the optimum ee Prasad 
and Graham (1994) used both these variables in the 
selection of the matched sample (hence automatically in the 
estimation) and showed empirically that their strategy fares 
better than that of Chotai (1974). In addition, to gain in 
efficiency, Prasad and Graham’s (1994) strategy can be 
used in practice, because iy involves only one unknown 
parameter, @. It should be noted that Arnab (1991) first 
introduced the principle of stratification using y,,’s, ies, as 
a stratification variable. This should always be done in 
practice whenever the necessary information is available, 
particularly in the selection of large units with marked size 
differences of the type considered in the numerical 
examples in section 3. Amab’s (1991) strategy is expected 
to be more efficient than the preceding strategies, since it 
utilizes first occasion values for stratification in addition to 
estimation. However, the optimal estimator eo contains the 
several unknown parameters (for details see Arnab 1991) 
which may hinder the application of the strategy especially 
when the sample size is not large enough. 


2. PROPOSED STRATEGIES 


Here two sampling strategies have been proposed which 
are modifications of strategies proposed by Prasad and 
Graham (1994) and Arnab (1991), respectively. 


2.1 Strategy 1 


The sampling scheme for this strategy is the same as was 
considered by Prasad and Graham (1994), and described in 
section 1.3.4. Here, only the estimator based on the 
matched sample s,,, has been modified by introducing the 
original size measure into the estimation. The proposed 
modified estimator ae and the composite estimators for Y, 
are as follows: 


a= Dy (yo/P; )P,- 22 @, Ip P z = 

ES 4 i€s,, 

» Gia PS + BZ 

where =. z,_ =z, A,/D,, ¥3; =, A;/P,.1;) =1; 4, ee 
y>;~ Bz, and B isa suitably chosen constant to minimize 
variance of re SD: we and A, are as described in the 
section 1.3.4; 


OO O)1,, 
where re is given in (2). 

Denoting £,(V,) as unconditional expectation 
(variance) over selection of the sample s,, and E, (V,) the 
conditional expectation (variance) over S,, wien; S18 pec 
one gets the variance of je for a given "value of B, as 


Lig IB) EG IB) 2, £,(¥,,, 1B). 
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Following Prasad and Graham (1994), we obtain 
oy * 2 
BY VY 1B yRH koe CB) 


and 


VER PINE Ray Ss, 


where 


k,=N(n- m)l{nm(N- 1)}; 


o; (B) sie q, (7/4; i RY 


ieU 
efit at NOL pagit , 
='0; +)B* 09-2 Bla, 6, 6; 


R 2» Rio 4.0 = 3! GG...) 


= 3 q(z,/q; of Zs 
ieU 
(7) 
93> Dy 4; (Voi! 9; - 


ieU 


Y,)(2,/q; 74) 


95, k and o5, q, are as in (4) and (6), respectively. The 
optimum value of B that minimizes V ( ae IB) comes out 


as, opt Papi =00,/,0, 
Putting the acne value of B =, in the expression 
of V(Y,,, 1B), we get the optimum value of 


V(¥;,, 1B) = V(P;,, 1By) = KI - f) + (1-2) C/A] 0 


where C* = (1 - 
respectively. 

The optimum variance of ie for a given value of A is 
obtained by minimizing the variance of y, with respect to @ 
when B = B,, and is given by 


5’) C; k, f and © are defined in (4) and (6) 


V (Wo 1h) SE 


opt 


ARE TKO LM a. 


=[1/{k(1-f) + (1-2) C/A} + /{ RC - fir) } 11 5. 


Finally, minimizing V, ee IX) with respect to 1, the op- 
timum proportion of ‘the ered sample and minimum 
variance of Y, are obtained respectively as 


opt A=), =Vv6"/(1 + VC") 


and 


V...(¥,) =k (1 - f+ v6") 05/2 = M, (say) (8) 


min 


Remark 2.1 


The estimator Ne , described in (1) is usable in practice 
when the optimum value of B = B, is known, or a good 
guess value of B, is available from some previous surveys. 
If instead of the regression estimator Va described above, 
one uses the difference estimator ret = Se (V;/D; )P, - 
ay (z; /p,, P= Z| based on the matched sample, the 
expression for the minimum variance of y would be as 
follows: 


V._.(¥,) =k - f+ Vb) 65/2 = M, (say) 
with 


614773 278)G) c= Gf o3% 


2.1.1 Variance Estimation 


To get approximate unbiased estimators for Von ( 3), 
we first present the following theorems without proof: 


Theorem 1 


Zr) cl hh Me res) {z On Al PIP Aa t Fan") 


ee 


in nN 


is an unbiased estimator of eeey when By is known, 


+{k/k} > (nite 


IES, 


k =(N-n)/{n(N - 1)} and k, =(n- m)/{m(n- 1)}. 
Theorem 2 

Vy Dies, F //p, 1=N-m)l{nm(N- 1)} [0% +o.- 2953] 
on be estimated unbiasedly by 

{(n- m)in(m -1)} ¥) lp; - DSF lp, YP, 


~ * = e Dae, C 5 
where 7, =7,A,/p,,7, =>; — 2;; 93, Op and 6,, are given in 
(4) and (7) respectively. 

From the Theorem 2 we note that 


a3 a - 


IES, 


8 UNO 
Ss 2 PP) Be 
eS, 


6,-d)> pales 


eS, 


LOND 
9 Pip) Dd; 


18S fo 


and 
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830 ay [zl as a Piles) 


1S, 


eS, 


[2/2 


OB anPtp) Ps 


are unbiased estimators of ae o; and o 
where d=m(N-1)/{N(m -1)}. 


30° respectively 


Estimator for Kan 


(¥,12) 
Thus for a given value of m (i.e., 2), we can suggest an 


approximate unbiased estimator of V, = (YI) as, 


WD) aC erg WV, 74 


Urs 


where V = 2 (2... IB,) and V, = an unbiased estimator of 
VPy,) = (NV WING - 1) Seq, P) (Yas!P,~ Poy? 
Estimator for V We ) 


min 
Putting suitable estimators for 


expression for V. 
estimator for V,. 


X,¢* and 35 in the 
Jara ae ), we get an approximate unbiased 
Hee y as, 


V.(Y,) =k - f+ (1 - A) O46, 


min 
where 


* 


C= (0 C= Ve Ts Veo) 


OREO) ESS SPP 
oe 853/(89 63)", G = 63/6;", 


6; = 163 (m) + (1-4) 83 (u) 


6; (m) = an approximate unbiased estimator of 03 based 
on the matched sample s,, = ieee ys, A,/p;)P,/p;" - 

fr zs ie hs 65 ( u) = an approximate unbiased estimator 
of 3 based on the un-matched sample s, = u(N - 1)/ 
Nien loan, Fin ba part a)s: k and fare as in (4). 


Remark 2.2 


Ideally one should estimate °; through the optimum 
combination of 67(m)and 67(u) and in this case, the 
optimum combination will involve unknown parameters. 
To avoid this complexity, the simpler estimator (67) of o7 
has been suggested above. 


2.2. Strategy 2 


The population is supposed to consist of LZ strata with N, 
as the unknown size of the A-th stratum (h = 1,..., L; 
YN, = NV) stipulating that one can identify the stratum to 
which a unit belongs, as soon as its value is observed on the 
first occasion. On the first occasion, the initial sample s, of 
size n was selected by PPSWR method with normed size p, 
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attached to the i-th unit. Let , units of s,, falling in the 
h-th stratum, be denoted as s,,. Let y,,(A),y,,(h) be 
respectively the value of the variate under study, of the i-th 
unit of the 4-th stratum for the first and second occasions, 
and z,(h) be the corresponding size measure. On the 
second occasion, independent samples s,,’s of sizes 
m, =mn,/n (assumed an integer for every h), keeping 
y,™, =m as fixed, are selected by the RHC sampling 
scheme with normed size 4), =[y,,(A)/z,(A)]/ 
a [y,,;(2)/z, (A)] for the i-th unit of h-th stratum. The 
unmatched sample s, was selected from the entire 
population by the RHC method with normed size measure p, 
for the i-th unit as in strategy 1. The proposed estimators 
for Y,, based on the matched-sample s,, and the 
un-matched sample s, are respectively as follows: 

Yom a » da Yom 


Ci) ee Ope: (9) 


where 


om b= De r(h) Dn! (yp Pi Ti oS of (h)/ 


Sih Sth 


(M14 Pry» Wy = My! Pry = 2 (A)/Z, 


r(h) =a, (lt) = Chih); 


Q,,; = sum of qn for the group containing 7-th unit of the 
h-th stratum, that was formed for selection of the matched 
sample s,,, by RHC method. c,’s are constants chosen to 
minimize variance of i (2). Following Arnab (1991), 
the expression for variance of oe is obtained as: 

N, 


VL, = Seas De Inj Taj! Wry Rie) a o5/n 
h j=l 


where k, =(n-m)/n, Vn = Vy (h)ly,(h),¥,(h) = i : 
Vy (h), N, = population size of the A-th stratum, 
P(h)= Z, 1Z, 2 As "1 2,(h). 
The ahaa value of c, that minimizes Va, ) and the 
corresponding value of V (Y,,,) comes out respectively as 
Ny 
opt c, =¢,(0) = 4, = De Inj pj Bui! (G49 Op, 3) 
Jel 


and [1 +(n-m)®@/m]o;,/n, where 


0, = Yy(h)/ qj, Bin ch)s Bri = Ziyi! Vy = Lys 
N, N, N, 
ae) g 
O13 =, Inj nj» Ono = i» Inj Buy» ¥,(A) = > y (A) 
j=l j=l i 


and @ =).,(1 - 8) oy3/{ P, 05}. 
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The proposed composite estimator for Y,, the optimum 
proportion of matched sample and the expression for the 
minimum variance of the composite estimator Y, are given 
respectively by 


V, = oY, gale: -@) oy 
opt d = A, = [6 - (1-f) V0 vf*1/L0 + fv0 Vf* - 1] 


Vsin( Yq) = KC /py —f) 05/1 + (Ag/ My )Vf*/ V0] 
= M, (say) 


where re and Ley are given in (9), f° = N/(N-1), 
Ma = l-Ayi hk 7 and 95 are given in (4). 


3. EFFICIENCIES OF THE PROPOSED 
STRATEGIES 


The proposed Strategy | is more efficient than the 
strategy proposed by Prasad and Graham (1994) in the 
sense of yielding smaller minimum variance, as Oe 
Efficiency of the Strategy 1 increases as 6, the correlation 
between y,,/q, and z,/q, increases. The efficiency of the 
Strategy 1 and Prasad and Graham’s (1994) strategy 
increases as ¢ decreases. The value of ¢ = 05 / 6, depends 
on the magnitudes of 6; and ae 6, will be smaller 
(greater) than 05 if the proportionality of y,, on y,, is 
higher (lower) than that of y,, on z,. Obviously, Strategy 
1 can be used in practice when a good guess value of B is 
available from the past surveys. If the difference estimator 
is used in Strategy 1 instead of the regression estimator 
mentioned in Remark 2.1, then the proposed Strategy 1 
fares better than that of Prasad and Graham (1994) 
whenever 6>1'26,/06,. Strategy 1 fares better or worse 
than Chotai’s (1974) strategy according to C* =(1 - 87) G< 
or >(1- 6"). Here, 5* may be regarded as a correlation 
coefficient between y,,/p, and y,,/p,. In particular, if z,’s, 
are constant, then 6° becomes the simple correlation 
coefficient between y,,’s and y,,’s. The expression for the 
minimum variance M, for Strategy 2 is complex and does 
not yield any simple comparison with the other strategies 
described here. However, we note that the efficiency of the 
Strategy 2 increases as the stratum correlation 6,, 
increases. Following numerical examples based on the live 
data reveals that the proposed Strategy 2 fares better than 
Strategy 1 and also the alternatives proposed by Prasad and 
Graham (1994) and Chotai (1974). 

For numerical comparisons, three data sets are 
considered. One of them (will be called Population 1) was 
considered by Prasad and Graham (1994) which relates to 
the area under wheat in 1937 (y,)and 1936 (y,) and 
cultivated area (z) for a set of 34 villages in India, 
compiled by Sukhatme and Sukhatme (1970). The 
population | is stratified in two strata in accordance with 


area under wheat in 1936 less than or more than 200 acres. 
Parameters for this population are: N = 34, N, = 20, 
N,-= 14,6" = 1635, 6 =".3038, 6 =- 3511 O24 Gai ne 
Population 2 comprises of production of cereals in South 
America for the years 1980 (z), 1988 (y,) and 1989 (y,), 
compiled from The Statistical year book, United Nations 
(1988/89). The population is stratified in two strata 
considering 1988 production of more or less than 570 
(thousand metric tons). The parameters for this popula- 
tion 2° ares’ N= 19) N= Ttek IN ye 120 tO 39, 
5 = .7666, ¢= 1.1478, @=.3681. The population 3 
compiled by Singh and Chaudhuri (1986) relates to the area 
under wheat in hector during 1979-80 (y,)and 1978-79 
(y, )and total cultivated area in 1978-79 (z) of 16 villages 
of Meerut District. The parameters for the population 3 are: 
N-=16,.N,=9, Nj =1,°0 =.) 129, 0 L057], Ga 05, 
Q = .2827. 

The following table shows relative efficiencies of the 
proposed Strategies 1, 2 and the one proposed by Prasad 
and Graham (1994) with respect to Chotai (1974) which are 
respectively denoted by E, =V,/M,, E,=V,/M, and 
BVI 


Table 1 
Efficiencies of the Strategies 


f Population 1 
E E E, E 
05 1.0463 1.1033 1.0181 1.0196 1.0850 .8262 1.0053 1.0864 1.0030 


Population 2 


LE, 


Population 3 


E E. 


E 2 3 


1 1 1 


.10 1.0479 1.0895 1.0187 1.0202 1.0711 .8212 1.0055 1.0711 1.0031 
15 1.0496 1.0776 1.0194 1.0209 1.0579 .8172 1.0057 1.0577 .0033 
.20 1.0514 1.0683 1.0200 1.0216 1.0519 .8123 1.0058 1.0469 1.0034 
.25 1.0533 1.0622 1.0208 1.0224 1.0490 .8071 1.0061 1.0396 1.0035 
30 1.0554 1.0604 1.0216 1.0232 1.0530 .8017 1.0063 1.0368 1.0036 


From the above table, we note that in all the three 
populations, Strategy 2 fares better than the others. It is also 
worth noting that both the proposed strategies fare better 
than those of Chotai (1974) and Prasad and Graham (1994). 
For the population 1, ¢ = .3811 which is quite favourable for 
Prasad and Graham’s (1994) strategy, hence for the 
proposed Strategy 1. Both Prasad and Graham’s strategy 
and Strategy 1, performed better than Chotai’s (1974) 
strategy. For the population 2, ¢ = 1.1478 which is high 
and unfavourable for Prasad and Graham’s (1994) strategy, 
but 46 = .7666 is quite favourable to Strategy 1. Hence, for 
the population 2, Prasad and Graham’s strategy becomes 
less efficient than that of Chotai (1974), but the proposed 
Strategy | remains better. For the population 3, ¢ = .3965 
which is quite favourable for Prasad and Graham (1994) but 
at the same time 6* = .7729 and this (6*) favours Chotai 
(1974). In fact Chotai’s (1974) strategy is marginally 
inferior to Prasad and Graham’s (1994) strategy but the 
proposed Strategy 2 remains better than both. It should be 
noted that the examples shown here are quite unusual in the 
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sense that they present low correlation between y, and z 
(in example 1, 5 = .3638 and in example 3, 6 = .1057 ) and 
there is a negative correlation between y, and y, 
(5* = -.6939 ) in example 2. The correlations 5 and 6° 
are expected to be high and positive. Hence, further 
investigation is needed to compare the performances of the 
present strategies with suitable data. 


Table 2 
Sensitivity of Efficiency £* = Vg! Mg 
|v | .05 .10 is .20 25 30 
Population 1 
0 1.028 1.029 1.030 1.031 1.032 1.033 
2 1.027 1.027 1.028 1.029 1.031 1.032 
4 1.023 1.024 1.027 1.026 1.027 1.028 
6 1.017 1.108 1.019 1.019 1.020 1.021 
8 1.010 1.010 1.010 1.011 1.011 1.011 
1.0 1.000 1.000 1.000 1.000 1.000 1.000 
12 .989 .988 988 988 .988 .987 
1.4 .976 .976 975 974 973 972 
Population 2 
0 1.234 1.241 1.249 1.257 1.266 1.278 
2 1.219 e227 1.233 1.241 1.249 1.258 
4 1.180 1.186 1.191 1.197 1.204 1.211 
6 is) 1.128 1.133 1137 1.141 1.146 
8 1.063 1.065 1.067 1.068 1.070 1.073 
1.0 1.000 1.000 1.000 1.000 1.000 1.000 
w2. .939 .938 .936 SBS .933 931 
1.4 .883 .880 877 .875 871 .869 
Population 3 
0 1.002 1.002 1.004 1.003 1.003 1.003 
aD; 1.002 1.002 1.002 1.002 1.003 1.002 
4 1.002 1.002 1.002 1.002 1.002 1.002 
6 1.001 1.002 1.002 1.002 1.002 1.001 
8 1.001 1.001 1.001 1.001 1.001 1.001 
1.0 1.000 1.000 1.000 1.000 1.000 1.000 
1.2 999 Ley) .999 .999 2999 5999 


1.4 998 29977) 998 998 .998 .998 


To study the effect of departure of the optimum value of 
B =B) when some guess value of B is used in Strategy 1, 
one may consider sensitivity of efficiency of A for the 
Strategy 1 for different choices of B, following Prasad and 
Srivenkataramana (1980). The minimum variance of Y, for 
the Strategy 1 when some guess value of B, = B is nee 
produces 

Vrain (|B) = kL - f+ VE") 03/2 = Ms; (9) 

where 6** = [1 - (1 -v’)8*]¢ and v = 1 - B/B,. 

From (9), we note that the proposed Strategy 1 with 
the guess value fh fares better or worse than Prasad and 
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Graham’s (1994) strategy according to |lvI<1 orlvI>1. 
Similarly, the proposed Strategy 1 with B = 8 performs 
better or worse than Chotai’s (1974) strategy according to v 
>or<(1 - 1/87) (1 - 1/G). Table 2 proceeds sensitivity E * 
of the estimator is compared to Prasad and Graham’s 
(1994) strategy where E* = V,./ Mg. From the Table 2, the 
loss with v > 1 is likely to be more than the gain with v < 1 
for population 1 and population 3 but the situation is 
reverse for population 2. 


CONCLUSION 


In sampling over two occasions, one should utilize data 
collected on the first occasion to get an efficient estimator 
for the population total on the second occasion. Chotai 
(1974) used data collected on the first occasion at the stage 
of estimation, while Prasad and Graham did so at the stage 
of selection (and hence estimation) of the matched sample. 
In this article, two strategies have been proposed. The first 
one utilizes data collected at the first occasion for the 
selection of the matched sample similar to Prasad and 
Graham and formation of a regression estimator as 
determined by Chotai (1974). These make Strategy 1 more 
efficient than that of Prasad and Graham. The proposed 
Strategy 2 utilized first occasion values as a stratification 
variable, measure of size for the selection of the matched 
sample for the second occasion, and formation of a 
regression type estimator involving auxiliary variable (z), 
available on the first occasion. Intuitively one should 
expect the proposed Strategy 2 to perform better than the 
others mentioned here, but no theoretical result was 
established due to the complexity of the expression for the 
minimum variance of the proposed estimator. However, 
superiority of the Strategy 2 was established through 
numerical data. 
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Confidence Intervals for Proportions With Small Expected Number 
of Positive Counts Estimated From Survey Data 


EDWARD L. KORN and BARRY I. GRAUBARD’ 


ABSTRACT 


In the nonsurvey setting, “exact” confidence intervals for proportions calculated using the binomial distribution are 
frequently used instead of intervals based on approximate normality when the number of positive counts is small. With 
complex survey data, the binomial intervals are not applicable, so intervals based on the assumed approximate normality 
of the sample-weighted proportion are used, even if the number of positive counts is small. We propose a simple 
modification of the binomial intervals to be used in this situation. Limited simulations are presented that show the coverage 
probability of the proposed intervals is superior to that of the normality-based intervals, logit-transform intervals, and 
intervals based on a Poisson approximation. Applications are given involving the prevalence of Human Immunodeficiency 
Virus (HIV) based on data from the third National Health and Nutrition Examination Survey, and the proportion of users 
of cocaine based on data from the Hispanic Health and Nutrition Examination Survey. 


KEY WORDS: Binomial confidence interval; Exact confidence interval; Logit transformation; Poisson confidence 


interval. 


1. INTRODUCTION 


With complex survey data, the typical construction of a 
1 - a level confidence interval for a proportion of positive 
counts for a 0-1 variable is 


ptt, (1 - 0/2) [var(p)]'”” (Aly 
where Pp is the sample-weighted estimator of the proportion, 
var(p) is the variance estimator of p, and ¢,(1 - a/2) is 
the 1 - a/2 quantile of a ¢ distribution with d degrees of 
freedom. The estimator var(/) is computed using lineari- 
zation or a replication method to reflect the sample design, 
including the fact that p is a sample-weighted estimator. 
By complex survey data, we mean data obtained from a 
multistage design with stratified selection of clusters at the 
first stage. For such a sample design, d is usually taken to 
be equal to the number of sampled clusters minus the 
number of strata (Kom and Graubard 1990). The 
confidence interval (1.1), which we shall refer to as the 
“linear interval”, is based on the assumption that p is 
approximately normally distributed. Under various 
reasonable asymptotics, this is known to be true (Krewski 
and Rao 1981). The use of the ¢ quantile rather than a 
normal-distribution quantile in (1.1) is based on empirical 
evidence (Frankel 1971, ch. 7), and it can also be formally 
justified using strong assumptions (Korn and Graubard 
1990). 

When the expected number of positive counts is small, 
the approximate normality of breaks down (Cochran 
1977, p. 58). For a simple random sample (or in the 
nonsurvey setting), one can avoid the normality assumption 


by using the Clopper and Pearson (1934) confidence 
interval based on the binomial distribution; see Vollset 
(1993) for a complete discussion of confidence intervals for 
proportions in the nonsurvey setting. When x positive 
responses are seen in a simple random sample of size n, the 
Clopper-Pearson 1-a_ level confidence interval 
(p,(x,n), Py (x, m)) can be expressed as (Johnson, Kotz 
and Kemp 1993, p. 130): 


sins v, F, , (a/2) 
SL aS 
Pr »+v,F, , (a/2) 
1°"2 
Yate Crs a2) 
Die) Sie es a a) 


where v, =2x,v,=2 (1-x+1), v,=2(x +1), v,=2 (1-x) 
and F’, 4,(B) is the B quantile of an F distribution with d, 
and d, degrees of freedom. For one-sided confidence 
bounds, o is used instead of a/2 in the above expressions. 
For a simple random sample, these intervals are known to 
have coverage probability greater than or equal to their 
nominal level, regardless of the expected number of 
positive counts. They are sometimes referred to as “exact” 
confidence intervals; we shall refer to them as the “binomial 
intervals”. 

In this paper we suggest a simple modification to the 
binomial intervals to make them applicable for a proportion 
estimated from complex survey data. We are especially 
interested in the situation when the expected number of 
positive counts is small. Many survey analysts would not 
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present estimated proportions in this situation, since they 
are unreliable. For example, applying the relative-standard- 
error criterion for presenting proportions in the 1996 
National Household Survey on Drug Abuse (SAMHSA 
1998), the estimated proportion of women using cocaine in 
Table 7 would not be presented. We believe such 
proportions can provide valuable information, but that their 
lack of precision needs to be explicitly stated by presenting 
confidence intervals. In section 2, we define our proposed 
confidence intervals and define intervals based on a logit 
transformation and the Poisson distribution that have been 
suggested in the literature. Simulation results are presented 
in section 3 that compare the intervals. We find that the 
proposed intervals behave well in terms of coverage 
probability of the true proportion and in terms of their 
average width. Two applications are given in section 4 
involving large surveys, but where the number of positive 
counts is expected to be small. We end with a discussion of 
some related work that constructs confidence intervals that 
are guaranteed to attain their nominal coverage probability 
regardless of the population configuration of counts. 


2. PROPOSED AND OTHER CONFIDENCE 
LIMITS 


For a 1 -a level confidence interval based on a sample 
of size n, first define the effective sample size by 


aes (2.1) 


and the degrees-of-freedom adjusted effective sample size 
by 
2 
»_ p(1-p) Debt) 2) (2.2) 
var(p) | ¢,(1 -a/2) 


Both n* and nj, are set equal to n when p=0. The 
proposed limits substitute Nay fc for n, and pny for x in (1.2), 
viz. P,( bngp ny) and Py Pngp Ny): (When z is large, the 
1 -a/2 quantile of a normal distribution can be used in 
place of ¢,_,(1 - a/2) in (2.2).) For estimating a confi- 
dence interval for a proportion on a subdomain of the 
population, the sample size n is taken to be equal to the 
sample size restricted to the subdomain. 

A heuristic justification for this procedure is as follows. 
The effective sample size (2.1) is n divided by an estimator 
of the design effect of the survey. This seems to be a 
reasonable way to incorporate the additional variability of p 
due to the complex sampling. For confidence interval 
construction, the variability of the variance estimator is also 
important. The second fraction in (2.2) takes into account 
the fact that var(p) will typically be more variable than a 
variance estimator that would be used for simple random 
sampling. If dis large, then this factor is close to one and 
unneeded. For small d and large n and pn a we would like 
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the proposed interval to be close to the interval (1.1), which 
is appropriate in this situation. Using the fact that 
is (PB) = 1+2(B) ¥2(1/u + 1/w) for large u and w (Johnson 
and Kotz 1970, p. 81), this is true, ie., p -p(bngy, Nay) = « 
Py(PrgyNy) -~p#=t,- -a/2){var(p)}!”. 

A procedure closely related to the proposed procedure 
was developed by Breeze (1990) for use in the U.K. 
General Household Survey. This procedure is based on the 
simple-random-sampling 1-a _ confidence interval 
(po, (x), po,,(x)) for a Poisson random variable x, which 
can be expressed as (Johnson et al. 1993, p. 171): 


po, (x) = 0.5 ee (a/2) and po,, (x) =0.5 Ye (1 -a/2) 


where v, = 2x,v, =2(x + 1), and x, *(B) i is the B quantile of 
a X? distribution with v degrees of freedom. With complex 
survey data, the confidence interval is taken to be 
(po, (pn*)/n*, poy(pn*)In*). 

A third procedure for confidence interval construction is 
based on a logit transform. For a 1 - a level confidence 
interval, the interval is 


1 1 
| btexp(-ll0GI7) my a A 
where 


a a ay 1/2 
LLOGIT = log—2— - 1,0. -a/2) K2# OT gg) 
ap Pep) 
and 
A 1/2 
ULOGIT = log ae =e UD) rts (2.4) 


These intervals, with a normal-distribution quantile instead 
of a ¢ distribution quantile, were suggested for use with the 
1996 National Household Survey on Drug Abuse 
(SAMHSA 1998). When / = 0, in the nonsurvey setting 
one might add a small constant to the observed number of 
events and nonevents, e.g., 1/2, to be able to calculate the 
logit-transform confidence interval (Agresti 1990, pp. 249- 
250). In the present setting, when p=0, we set the 
confidence interval equal to the binomial interval 
(p,(0, 1), Py(0,n)). 

In applications where it is known before sampling that 
the (true) design effect will be greater than 1, various 
modifications of the above procedures are possible. For our 
proposal, we recommend in this situation truncating the 
degrees-of-freedom adjusted effective sample size at n. 
That is, if Ny is greater than n, we Set its value to n, and 
define the lower and upper confidence limits to be 
p,,(pn,n) and p,,(pn,n). For the Breeze intervals, one 
could set n* to be nif n*>n. For the linear or logit 
intervals, one can use the simple-random-sampling variance 
estimator p(1 -p)/n in place of var(p) in (1.1), (2.3) and 
(2.4) if n*>n; see SAMHSA (1998) for additional 
truncation suggestions. The justification of these truncation 
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procedures is that the design effect may be estimated to be 
less than one because of instability of the variance estimator 
var(p). This type of instability may be especially large 
because p is small (SAMHSA 1998). The effect of these 
truncation procedures is to make the confidence intervals 
wider and more conservative. In theory, one could also 
adjust the estimated effective sample sizes when it is known 
before the sampling that the (true) design effect is less than 
one. However, to be conservative, we do not recommend 
doing this. 

Our focus in this paper is on confidence intervals for the 
“superpopulation” probability that the outcome Y = 1| rather 
that the finite-population proportion. That is, the target 
parameter is p="_, p,/N rather than P = Y'"_, ¥,/N, 
where Y, has a Bernoulli distribution with parameter p,_, 
and WN is the population size. The simulated coverage 
probabilities given in the next section therefore refer to 
coverage of p. With this target parameter in mind, we do 
not use finite-population correction factors when estimating 
var(p) for use in (2.2); additional adjustments to the 
design-based variance var (p) for superpopulation inference 
are not pursued here (Korn and Graubard 1998). A referee 
suggests the possibility of a model-based approach to 
estimating a confidence interval for p. However, in our 
limited experience, such approaches yield estimators similar 
to weighted estimators and offer no advantages for 
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inference (Pfeffermann and LaVange 1989; Graubard and 
Korn 1996). 

If one were interested in a confidence interval for P, we 
would recommend using the proposed intervals but with 
var (p) in (2.2) containing the finite-population correction 
factors. A confidence interval for ae Y, could be 
obtained by multiplying the ends of the confidence interval 
for P by N, if known, or by an estimator N of N, if not 
known. (In theory, one could account for the variability of N, 
but this additional variability will be small.) An alternative 
approach for estimating a confidence interval for P would 
be to modify the usual limits (Guenther 1983) appropriate 
for a simple random sample (based on the hypergeometric 
distribution) similarly to the way the proposed intervals 
modify the binomial intervals. 


3. SIMULATIONS 


The main simulation results are presented in Tables 1-5. 
Table 1 presents the results of simulations in which datasets 
of 32 clusters, each with sample size 100, were simulated. 
Within cluster i, the number of positive events was 
simulated with a binomial distribution with probability 
parameter p,. In Table 1, we refer to the {p,,i = 1,..., 32} 
as the cluster probabilities. For the top third of the table, the 
cluster probabilities are taken to be the constant p = .1, .02, 


Table 1 
Simulated Lack of Coverage (Percent) of Upper and Lower One-sided 95% Confidence Bounds for Sample Design of 
32 Clusters and 100 Observations Per Cluster; Sample Weights are 1 Or 10 with Probability 1/2 
(Noninformative Sample Weights) 


Distribution of Expected 

cluster Oxerat number 

proportions* eee positive Linear 
Lower Upper 

(1) 

ll sil 320 4.6 Soo) 

.02 .02 64 3.4 Tal 

O01 .O1 3) As) 8.0 

.0025 .0025 8 1.6 9.5 

(1/2, 1/2) 

10555 sil 320 4.3 5.8 

.01, .03 .02 64 Sal ES 

.005, .015 01 32 27 8.6 

.00125, .00375 .0025 8 Ile) DS 

(3/4, 1/4) 

LODZ pil 320 3) 7.8 

.01, .05 .02 64 2.7 8.6 

.005, .025 01 32 22 9.8 

.00125, .00625 .0025 8 1S 10.7 


Method of calculating confidence bounds 


Logit Breeze Proposed 
Lower Upper Lower Upper Lower Upper 
5.3 4.6 4.5 4.1 4.8 4.4 
De 4.6 4.5 4.7 4.2 4.4 
5.4 45 4.4 4.5 4.0 4.1 
a5 1.8 3.6 BD 33 1.8 
5) 4.3 4.3 3.8 4.7 4.1 
5.2 4.8 4.3 4.8 4.0 4.5 
2) 4.7 4.1 4.9 Bhd) 4.4 
5.4 2.0 3.4 2 3.1 2.0 
4.7 5.6 3.4 5.0 3.6 56 
2) 5:3 4.0 5.4 3c 5.0 
5.0 38 Sell 5.5 33 5.0 
See 22 333 Des) 3.0 Dee 


(a) Fractions in parentheses are the probabilities that the cluster proportions have the stated value. 
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Table 2 
Simulated Lack of Coverage (Percent) of Upper and Lower One-sided 95% Confidence Bounds for Sample Design of 
32 Clusters and 100 Observations Per Cluster; Informative Sample Weights are 1 or 10 (See Text) 


Distribution of Overall Expected Method of calculating confidence bounds 
cluster weighted number 
proportions* proportion _ positive Linear Logit Breeze Proposed 


Lower Upper Lower Upper Lower Upper Lower Upper 


(1) 

1 al 191.0 4.3 Sy 5.1 4.9 4.2 4.4 4.6 4.6 
02 .02 36.9 33 13 33 4.3 4.4 4.4 4.1 4.1 
O01 01 18.4 2.8 8.7 38) 4.0 4.3 4.3 SH) 3o7) 
.0025 0025 4.6 1.3 18.7 6.1 4.8 3.2 4.8 2.8 4.8 
(1/2, 1/2) 

OD yalD oll 191.0 5.0 5.0 6.4 Sh ays Si) 5.4 3.4 
.01, .03 02 36.9 3.0 79 5.4 4.5 4.3 4.6 4.0 4.3 
005, .015 01 18.4 25 Oe) 5.4 4.2 4.1 4.4 Shell 38 
.00125, .00375 .0025 4.6 ies} 19.0 6.1 4.9 32 4.9 2.8 4.9 
(3/4, 1/4) 

055225 sl! 191.0 4.7 5.7 Teal 4.1 ll 3.6 S) 3.8 
.01, .05 02 36.9 2.6 8.9 2 Se 4.0 53 357 4.9 
005, .025 01 18.4 3 10.1 5)53) 4.8 3.8 Sul 3.4 4.5 
.00125, .00625 .0025 4.6 12) 19.8 5.9 33) 32 o:3 2.8 5:3 


(a) Fractions in parentheses are the probabilities that the cluster weighted proportions have the stated value. 


Table 3 
Simulated Lack of Coverage (Percent) of Upper and Lower One-sided 95% Confidence Bounds for Sample Design of 
32 Clusters and 100 Observations Per Cluster; Unweighted Analyses 


Distribution of Expected Method of calculating confidence bounds 

cluster Oren number 

proportions* aie ee positive Linear Logit Breeze Proposed 
Lower Upper Lower Upper Lower Upper Lower Upper 

(1) 

l all 320 5.0 49 a, 4.2 4.9 3.8 22 4.1 

.02 02 64 3.8 6.3 a? 4.5 4.7 4.8 4.4 4.4 

O01 01 32 3:5 6.8 5.6 4.4 4.7 4.4 4.3 4.0 

.0025 .0025 8 Dg) 8.8 5.6 3.8 4.1 Bue, By) 3) 

(1/2, 1/2) 

AUB yy al |S: il 320 4.5 5.6 5.6 4.2 4.5 Sh) 4.8 4.0 

.01, .03 02 64 3.4 7.0 Sal 4.8 4.5 4.9 4.1 4.6 

.005, .015 01 32 3.0 7.6 32 4.8 4.4 4.8 3,9 4.4 

.00125, .00375 0025 8 22, 92 5.4 4.3 3.8 4.3 3h) 4.3 

(3/4, 1/4) 

OSy.25 Al 320 3:3 Ue 4.8 5.6 35 5.1 Sal! 53 

01, .05 .02 64 P18) 8.1 Sal 5)! 4.1 53) 3.8 4.9 

.005, .025 01 32 PS) 9:2 4.9 5.6 339 5.6 3h) D2 

.00125, .00625 .0025 8 2.0 10.4 Ses Shy) eae 3.8 Syl 352) Sill 


(a) Fractions in parentheses are the probabilities that the cluster proportions have the stated value. 
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Table 4 


Simulated Lack of Coverage (Percent) of Upper and Lower One-sided 95% Confidence Bounds for Sample Design of 32 
Clusters and 10 Observations Per Cluster; Sample Weights are 1 or 10 with Probability 1/2 


(Noninformative Sample Weights) 


Distribution of Expected Method of calculating confidence bounds 
Overall 

cluster cae number 

proportions? Prop positive Linear Logit Breeze 
Lower Upper Lower Upper Lower Upper 

(1) 

a2, p 64 4.0 6.6 S) 4.7 3a 4.7 

al al 82 312 7.8 5:3 4.4 3.6 3.8 

025 025 8 1 10.2 20) Paes 3.4 Del 

(1/2, 1/2) 

SERS 2 64 3.6 7.0 5.0 4.9 2.8 3.4 

(055-15 all 32 3.0 8.1 5, 4.6 3.4 4.0 

.0125, .0375 025 8 1.6 10.6 5.4 al 3.3 2.1 

(3/4, 1/4) 

alles) ys 64 oh 7.8 4.6 5:3 2.4 3.9 

AU By 2S oll 32 DES) O22 4.8 3)? 3.0 4.6 

0125, .0625 .025 8 hes} ES 3 2.4 3.2 3:5 


(a) Fractions in parentheses are the probabilities that the cluster proportions have the stated value. 


Table 5 


Simulated Lack of Coverage (Percent) of Upper and Lower One-sided 95% Confidence Bounds for Sample Design of 32 
Clusters and 10 or 100 Observations Per Cluster with Probability 1/2; Sample Weights are 1 or 10 with Probability 1/2 


(Noninformative Sample Weights) 


Distribution of Expected Method of calculating confidence bounds 
cluster Ruceall number 
proportions* PE positive Linear Logit Breeze 
Lower Upper Lower Upper Lower Upper 
(1) 
1818 1818 320 Sol 6.0 ay) DP 4.2 4.1 
.0364 .0364 64 4.1 7.6 Sa 3,7 5.0 Se 
.0182 .0182 32 3.4 8.5 Sl) 5.0 4.7 Sal 
.0045 .0045 8 2.0 WAT) D9 3.4 4.0 4.3 
(1/2, 1/2) 
.0909, .2727 1818 320 5.0 6.4 6.1 4.8 4.2 3.6 
0182, .0545 .0364 64 3.9 8.1 6.0 Syl 4.9 5.0 
.0091, .0273 .0182 32 all 2.3) 5.8 See 4.5 2)5) 
.0023, .0068 0045 8 1.8 113} 3.9 3.6 3.9 4.5 
(3/4, 1/4) 
.0909, .4545 1818 320 ail O19 4.6 7.6 8) 6.3 
.0182, .0909 .0364 64 2.8 10.9 aye) thes} 3.9 Ie 
0091, .0455 .0182 32 2.4 LES 5.4 6.8 3.9 6.9 
.0023, .0114 .0045 8 1.6 14.5 Sy 4.0 ay) 5.0 


(a) Fractions in parentheses are the probabilities that the cluster weighted proportions have the stated value. 


197 


Proposed 
Lower Upper 
4.2 4.3 
3.9 4.0 
Si 2.4 
3.9 4.4 
31. 4.2 
isi pus) 
33 48 
33 4.8 
3.0 2.8 


Proposed 
Lower Upper 
Sy 5.0 
4.8 4.9 
4.4 4.7 
3.6 3.8 
Se 4.4 
4.7 4.8 
4.2 49 
Be 4.0 
3.3 al 
Shi) 7.0 
3.6 6.5 
35 4.4 
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.01, or .0025, corresponding to an expected number of 
positive events equal to 320, 64, 32, or 8 out of the sample 
size of 3200. For the middle third of the table, the cluster 
probabilities are taken to be p/2 with probability 1/2 or 3p/2 
with probability 1/2, with p as in the first third of the table. 
Varying the p, across the clusters induces an intracluster 
correlation among the observations. For the middle third of 
the table, these correlations (ignoring the sample weights) 
are .00278, .0051, .0025 and .0006 corresponding to the 
expected number of positive events being 320, 64. 32, or 8, 
respectively. For the bottom third of the table, the cluster 
probabilities are taken to be p/2 with probability 3/4 or 
5p/2 with probability 1/4, corresponding to intraclass 
correlations of .0833, .0153, .0076 and .0019. For all 
simulations in Table 1, sample weights of 1 or 10 are 
randomly assigned with probability 1/2 to each observation 
(noninformative weights). 

The results presented in Table 1 are appropriate for one- 
sided 95% upper and lower confidence limits; ideally the 
lack-of-coverage percentages in the table should be less 
than or equal to the nominal value of 5.0. The results are 
also relevant for two-sided 90% confidence intervals, for 
which ideally both the upper and lower values in the table 
should both be < 5.0 Jennings 1987). For each line of the 
table, 100,000 datasets were simulated using the random 
number generator in SAS (1990, p. 631) to estimate the 
probabilities of noncoverage of the confidence limits. 

For the linear confidence bounds, the upper confidence 
limit falls below the true value more than the 5% nominal 
level. Somewhat surprisingly, this is true even with as large 
as an expected 320 positive counts, especially with positive 
intracluster correlation (middle and bottom third of the 
table). For the logit-transform confidence bounds, the 
noncoverage appears slightly higher than the nominal level, 
especially for the lower limits. Both the Breeze and pro- 
posed confidence bounds appear generally conservative. 
Simple-random-sampling binomial limits are not appro- 
priate for the cases simulated in Table 1 because of the 
sample weights and the intracluster correlation (in the 
bottom two-thirds of the table). This can be demonstrated 
by noting that the lack of coverage for both the upper and 
lower binomial bounds are greater than 8% for all the cases 
considered in the table (results not shown). 

As it is slightly complicated to discuss confidence 
interval “lengths” for one-sided bounds, we restrict 
discussion to the lengths of the two-sided 90% confidence 
intervals. Over all the simulations presented in Table 1, the 
Breeze and proposed intervals are 3.3% and 4.9% wider on 
average than the logit-transform intervals. 

Table 2 presents simulation results for the same setup as 
Table 1 except that the sample weights were taken to be 
informative. This was done by setting the sample weight to 
be 10 with probability 2/3 if the event was positive and with 
probability 1/3 if the event was not positive, otherwise the 
weight was setto 1. The probability that an event was positive 
in each cluster was adjusted downwards so that the overall 
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weighted proportions were the same as in Table 1. The 
results in Table 2 look similar to those in Table 1 except the 
linear and logit intervals tend to have worse coverage 
probabilities. 

Table 3 presents simulation results for the same setup as 
Table | except the analysis is unweighted. The results are 
very similar to the Table 1 results. Since the top third of 
Table 3 corresponds to no intracluster correlation, one 
could also use the simple-random-sampling binomial limits 
there. Averaging over the four situations in this third of the 
table, the proposed limits are 2.5% wider that the binomial 
limits (results not shown). As the true design effect is 1.0 
in the top third of Table 3, these simulations can be used to 
examine the effect of truncation of Nu in the proposed 
procedure. (Truncation is uncommon in the simulations in 
Table 1, since the true design effects there are all >1.) 
Simulation using the proposed procedure with truncation 
lead to wider more conservative intervals than for the 
proposed intervals in the top third of Table 3. Averaging 
over the four situations considered, the proposed limits with 
truncation are 4.0% wider than the proposed limits (results 
not shown for truncated limits). 

Table 4 presents simulation results for the same setup as 
Table 1 except 10 rather than 100 observations are 
simulated within each cluster. The results are very similar 
to Table 1 when one compares simulations with the same 
expected number of positive events. The one exception is 
the increased conservativeness of the Breeze intervals as 
compared to the proposed method. This is because the 
overall proportions are higher in Table 4 than Table | for a 
given expected number of positive events (since the sample 
size is smaller in Table 4). The Poisson intervals of Breeze 
do not work well with proportions that are not small. For 
example, we performed a simulation corresponding to the 
top third of Table 1 except that the overall proportion p = .5 
with 1600 expected number of positive events. The 
simulated lower and upper lack-of-coverage percentages for 
the Breeze bounds were 1.2% and 1.3%, compared to 4.6% 
and 4.7% for the proposed method. The Breeze intervals 
were on average 37% wider than the proposed intervals. 

The Breeze intervals also do not work well when the 
number of clusters is very small, since they do not account 
for degrees of freedom of the variance estimation. For 
example, we performed a simulation corresponding to the 
top third of Table 1 except that data from only 8 clusters 
were simulated (with 100 observations per cluster), and 
p; = -1 so that the expected number of positive events was 
80. The simulated lower and upper lack-of-coverage 
percentages for the Breeze bounds were 6.1% and 5.4%, 
compared to 4.7% and 4.0% for the proposed method. 

Table 5 presents simulation results for the same setup as 
Table 1 except the cluster sizes were taken to be 10 or 100 
with probability 1/2. The lack-of-coverage probabilities are 
larger than the nominal 5% in the bottom third of the table 
for all the methods. The logit intervals also do not behave 
as well as in Table 1 for the top two-thirds of the table. 
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An additional set of simulations was done in which two 
clusters (each of sample size 50) were simulated from each 
32 strata. The expected numbers of positive event were 
taken as in Table 1, the weights were randomly set to 1 or 
10, and the probability of a positive event was taken to be 
different in the different strata to simulate an intracluster 
correlation. The results (not shown) were very similar to 
the results given in Table 1. 


4. APPLICATIONS 


In this section we consider two applications in which the 
numbers of positive counts are small. In the first applica- 
tion, involving estimating HIV positivity in an unselected 
population, the numbers of positive counts are small 
because the rates of HIV infection are small. In the second 
application, involving estimating whether individuals have 
ever used cocaine, the rates are not small but the numbers 
of positive counts are small because we restrict the analyses 
to relatively small subdomains. For both applications, 
SUDAAN (Shah, Barnwell and Bieler 1995) was used to 
calculate the (design-based) standard errors of the 
proportions, and the function “FINV” in SAS (1990, 
p. 547) was used to calculate the quantiles of the F 
distribution in (1.2). 


4.1 Seroprevalence of HIV Estimated From the 
Third National Health and Nutrition 
Examination Survey (NHANES III) 


NHANES III was a survey conducted in 1988-1994 of 
the civilian noninstitutionalized population ages 2 months 
or older of the United States (National Center for Health 
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Statistics 1994). An HIV test was performed on partici- 
pating individuals 18 years of age or older. McQuillan, 
Khare, Karon, Schable and Vlahov (1997) studied the 
seroprevalence of HIV for individuals under the age of 60 
years and various subgroups, some of which are displayed 
in Table 6. Of the 11,202 individuals tested, 59 were 
infected. The estimated prevalence in Table 6, 0.32%, is far 
from the unweighted proportion, 0.53% = 59/11202, 
because the estimated prevalence is a weighted proportion 
utilizing the sample weights. Because the testing for HIV 
was anonymous, for these analyses the sample weights were 
derived from the original NHANES III sample weights of 
all individuals in the same stand (survey location), 
race/ethnicity group, sex, and age group (18-39 vs. 40-59) 
of the tested individual (M. Khare, personal communica- 
tion). The pseudo-design for variance estimation was the 
sampling of 2 pseudo-PSU’s from each of 23 strata (M. 
Khare, personal communication), which is not the pseudo- 
design typically used for NHANES III variance estimation. 

The linear 90% confidence intervals for prevalence for 
the various groups in Table 6 are shifted to the left and 
shorter than the other intervals, which are similar to each 
other. The proposed intervals are very slightly wider than 
the Breeze or logit intervals. The effective sample sizes 
calculated in (2.1) are markedly smaller than the sample 
sizes because of the design effects of the survey; the 
confidence intervals based on the truncated procedures will 
therefore be identical to the ones given in Table 6. The 
differences between n* and Nyy are relatively minor. For 
this relatively rare outcome, the simulations given in section 
3 suggest that the Breeze and proposed confidence intervals 
may maintain their nominal 90% coverage probabilities 
better than the other intervals. 


Table 6 
Seroprevalence of HIV Among Adults Aged 18-59 Years Based on the Third National Health and 
Nutrition Examination Survey 


Sex Race/ethnicity 
Total 
Male Female White Black Mex. - Amer. 

Sample size 11202 5142 6060 4128 DoS 3495 
Number infected Sy) 44 15 9 38 12 
Prevalence (%) + SE 0.320+0.076 0.519+0.130 0.127 +0.053 0.203 + 0.071 1.100 + 0.247 0.368 + 0.134 
Effective sample size 

ny 5588 3056 4433 3976 7S 2039 

ny 5148 2816 4084 3664 1640 1880 
Linear 90% con. int. (0.19, 0.45) (0.30, 0.74) (0.04, 0.22) (0.08, 0.33) (0.68, 1.52) (0.14, 0.60) 
Logit 90% con. int. (0.21, 0.48) (0.34, 0.80) (0.06, 0.26) (ONO 7) (0.75, 1.62) (0.20, 0.69) 
Breeze 90% con. int. (0.21, 0.48) (0.32, 0.79) (0.05, 0.26) (0.10, 0.37) (0.73, 1.61) (0.18, 0.68) 
Proposed 90% con. int. (0.20, 0.48) (0.32, 0.80) (0.05, 0.26) (0.10, 0.37) (OPS) (0.17, 0.69) 
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Table 7 
“Ever Users” of Cocaine Among Adults Ages 12-44 Years Based on Individuals with 16 or More Years of Education 
Sampled in Hispanic Health and Nutrition Examination Survey 
Total oes 
Male Female 
Sample size 123 69 54 
Ever-users 13 10 3 
Proportion (%) + SE IROTES Veale cra) 7.0+4.8 
Effective sample size 
ie 167.1 105.0 28.2 
ny 132.8 84.4 22.9 
Linear 90% confidence int. (7.0, 16.2) (8.0, 20.7) (-1.9%, 15.9) 
Logit 90% confidence int. (CES 7al) (9:2 129) (1.9; 22.8) 
Breeze 90% confidence int. (Cea 20) (8.582322) (0.9, 24.8) 
Proposed 90% confidence int. (GATE) (S522 1) (0.9, 22.7) 
Truncated Procedures 

Linear 90% confidence int. (6.3, 17.0) (G5e22>2)) same as above 
Logit 90% confidence int. (7.2, 18.2) (8.1, 24.1) = 
Breeze 90% confidence int. C/E ESe10) (7.7, 24.4) i 
Proposed 90% confidence int. (PAs VI (8.0, 23.2) a 


(a) In practice, this interval would be presented as (0, 15.9) since negative proportions are impossible. 


4.2 Use of Cocaine Among College-educated 
Individuals Sampled in the Hispanic Health and 
Nutrition Examination Survey (HHANES) 


HHANES was a survey conducted in 1982-1983 of three 
Hispanic groups living in the United States (National 
Center for Health Statistics 1985). We restrict attention 
here to the Mexican-American sample. Individuals ages 
12-44 years were asked “About how old were you the first 
time you tried cocaine?”. The possible answers were the 
age of the individual (in years) when he first tried cocaine, 
a “never used” category, and a “don’t know” category. We 
consider estimating the proportion of “ever-users” among 
individuals who completed 16 or more years of education 
(for which there were no “don’t know” responses). 

There were 13 ever-users among 123 sampled indivi- 
duals, with the sample-weighted proportion being 11.6% 
(Table 7). The design-based standard error, 2.5%, is 
estimated with only 8 degrees of freedom since the 
sampling design of HHANES can be approximated by the 
sampling of 2 PSU’s from each of 8 strata (Kovar and 
Johnson 1986). The effective sample sizes are 
n* = 167.1 and nj = 132.8, which are both greater than 
the sample size. This is because the estimated design effect 
is .736, so that n* = 123/.736 = 167.1. (The second factor 
in (2.2) is 0.794.) Despite the stratification, we think that 
the true design effect is greater than | for this survey 
because of the clustering and the sample weighting. (The 
estimated design effect is estimated poorly because of the 
limited degrees of freedom.) We therefore think that the 
truncated procedures are reasonable for this application. 


Because of the limited degrees of freedom, and because 
the outcome is not rare, there are more differences between 
the logit, Breeze and proposed confidence intervals 
displayed in Table 7. Based on the simulations given in 
section 3, we recommend the proposed (truncated) 
confidence intervals. 

Our approach may appear slightly inconsistent for this 
survey in that we accept poorly-estimated effective sample 
sizes less than the sample size but truncate those greater. 
We believe that this is a reasonable conservative approach 
to use when it is thought that the true design effect is 
probably greater than 1. 


5. DISCUSSION 


Although the confidence intervals proposed here had 
adequate coverage probability for almost all the simulations 
performed, this is not guaranteed for all possible config- 
urations of the population, e.g., see the bottom third of 
Table 5. An example with a more serious lack of coverage 
can also easily be constructed: Suppose that the population 
consists of clusters of size 100, and that 10% of the clusters 
have all positive units and the remaining 90% have all zero 
units. If we sample 10 clusters as a simple random sample, 
and subsample all the units in the sampled clusters, then 
35% (= (1-.1)'°) of the time we will observe no positive 
units in the sample size of 1000. In this situation, our 
proposed intervals reduce to the usual binomial ones, so 
that, e.g., the upper 95% confidence limit for the population 
proportion is given by .003 (= 1-.05"'). This implies that 
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the upper 95% confidence interval is less that the true value 
of .10 at least 35% of the time, a serious undercoverage. 

It is possible in simple sampling situations to construct 
confidence intervals that are guaranteed to have at least 
their nominal coverage probability by considering all 
possible configurations of the population, and using a least- 
favorable configuration for the coverage probability. For 
the hypothetical single-stage cluster sample mentioned 
above, for example, an upper 95% confidence limit could 
be given by the binomial limit based on 0 positive units out 
of 10, i.e., .26 (=1-.05'”°). Such confidence intervals, which 
can become computationally intensive to calculate, have 
been studied by Gross and Frankel (1991), who also suggest 
some less computationally intensive approximations. 

The advantages of our proposed intervals over such 
approaches are (1) they are easy to calculate, (2) they 
accommodate any complex sampling design, including 
nonresponse and postsratification adjustments to the sample 
weights, (3) they will generally maintain their nominal 
coverage probability, (4) they will be less conservative than 
intervals that are guaranteed to maintain their nominal 
coverage probability for all population configurations, and 
(5) they have better properties than the linear intervals, 
logit-transform or Breeze intervals. Conclusions (2) and (5) 
are based on our simulation results, which of course do not 
cover all possible situations. More research would be 
useful in this regard. 
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